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Abstract 

Bayesian inference for graphical models has received much attention in the litera¬ 
ture in recent years. It is well known that when the graph G is decomposable, Bayesian 
inference is significantly more tractable than in the general non-decomposable setting. 
Penalized likelihood inference on the other hand has made tremendous gains in the 
past few years in terms of scalability and tractability. Bayesian inference, however, 
has not had the same level of success, though a scalable Bayesian approach has its 
respective strengths, especially in terms of quantifying uncertainty. To address this 
gap, we propose a scalable and flexible novel Bayesian approach for estimation and 
model selection in Gaussian undirected graphical models. We first develop a class of 
generalized G-Wishart distributions with multiple shape parameters for an arbitrary 
underlying graph. This class contains the G-Wishart distribution as a special case. 
We then introduce the class of Generalized Bartlett (GB) graphs, and derive an effi¬ 
cient Gibbs sampling algorithm to obtain posterior draws from generalized G-Wishart 
distributions corresponding to a GB graph. The class of Generalized Bartlett graphs 
contains the class of decomposable graphs as a special case, but is substantially larger 
than the class of decomposable graphs. We proceed to derive theoretical properties 
of the proposed Gibbs sampler. We then demonstrate that the proposed Gibbs sam¬ 
pler is scalable to significantly higher dimensional problems as compared to using an 
accept-reject or a Metropolis-Hasting algorithm. Finally, we show the efficacy of the 
proposed approach on simulated and real data. 

Keywords: Gaussian graphical models, Gibbs sampler, Generalized Bartlett graph, 
Generalized G-Wishart distribution, Scalable Bayesian inference 
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1 Introduction 


Gaussian graphical models have found widespread use in many application areas. Besides 
standard penalized likelihood based approaches (see Khare et al. (2015) and references 
therein), Bayesian methods have also been proposed in the literature for analyzing undi¬ 
rected Gaussian graphical models (see Asci and Piccioni, 2007; Dawid and Lauritzen, 1993; 
Letac and Massarn, 2007; Mitsakakis et ah, 2011; Rajaratnam et al., 2008; Roverato, 2000, 
2002; Wang and Carvalho, 2010, to name just a few). Bayesian methods have the distinct 
and inherent advantage that they can incorporate prior information and yield a full poste¬ 
rior for the purposes of uncertainty quantification (and not just a point estimate), whereas 
standard frequentist approaches for uncertainty quantification (such as the bootstrap) may 
be computationally burdensome and/or break down in high dimensional settings. However, 
it is well known that Bayesian methods for graphical models in high dimensional settings lag 
severely behind their regularized likelihood based counterparts, in the sense that they are 
not scalable except under restrictive assumptions on the underlying sparsity pattern (such as 
for decomposable graphs). Hence a scalable and more general approach to graphical models, 
with theoretical and computational safeguards, is critical to leveraging the advantages of 
posterior inference. 

To outline the issues with current Bayesian methods more clearly, consider i.i.d. vectors 
Yx, Y 2 , • • • , Y„ drawn from a p -variate normal distribution with mean vector 0 and a sparse 
inverse covariance matrix fl The sparsity pattern in D can be encoded in terms of a graph 
G on the set of variables as follows. If the variables i and j do not share an edge in 
G, then = 0. Hence, an undirected (or concentration) graphical model corresponding 
to G restricts the inverse covariance matrix D to a submanifold of the cone of positive 
definite matrices (referred to as Pg). A Bayesian statistical analysis of these models requires 
specification of a prior distribution (supported on Pg) for D. Dawid and Lauritzen (1993) 
introduced a class of prior distributions for E = D _1 called the Hyper Inverse Wishart (HIW) 
distributions. The induced class of prior distributions for D (supported on Pg) is known as 
the class of G-Wishart distributions (see Roverato (2000)). This class of prior distributions 
is quite useful and popular, and has several desirable properties, including the fact that it 
corresponds to the Diaconis-Ylvisaker class of conjugate priors for the concentration graph 
model corresponding to the graph G. 

Closed form computations of relevant quantities corresponding to the G-Wishart distri¬ 
bution, such as expected value of the precision matrix and quantiles, are in general available 
only if the underlying graph G is decomposable, i.e., G does not have any induced cycle 
of length greater than or equal to 4. A variety of approaches have been developed in the 
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literature to generate samples from the G-Wishart distribution corresponding to a general 
non-decomposable graph. Asci and Piccioni (2007) have developed a maximal clique based 
Markov Chain Monte Carlo (MCMC) approach to sample from the G-Wishart distribution 
corresponding to a general graph G. Lenkoski (2013) develops a direct sampler for G-Wishart 
distributions corresponding to a general graph G. This approach uses an iterative algorithm 
to minimize an objective function over the space of positive definite matrices with appropriate 
sparsity constraints. Wang and Carvalho (2010) have developed an accept-reject algorithm 
to generate direct samples from the G-Wishart distribution corresponding to a general graph 
G. Mitsakakis et al. (2011) have developed a Metropolis-Hastings based MCMC approach 
for the same. 

While the G-Wishart prior is clearly very useful for Bayesian inference in graphical mod¬ 
els, it has an important drawback. In particular, the G-Wishart distribution has only one 
shape parameter, which makes it potentially inflexible and restrictive in terms of prior spec¬ 
ification. Letac and Massam (2007) address this issue by constructing the so-called Wp G 
and Wq g families of distributions which are flexible in the sense that they have multiple 
shape parameters. These distributions include the G-Wishart as a special case, and form a 
standard conjugate family of prior distributions for undirected decomposable graphical mod¬ 
els. The construction of the Letac and Massam distributions uses the structure associated 
with decomposable graphs. It would thus be useful to develop a class of prior distributions 
which is flexible (multiple shape parameters) and leads to tractable Bayesian inference for 
non-decomposable graphs. 

In this paper, we aim to develop a scalable and flexible Bayesian approach for estima¬ 
tion and model selection in Gaussian undirected graphical models for general graphs. Our 
approach preserves the attractive properties of previous approaches, while overcoming their 
drawbacks. We first develop a class of generalized G-Wishart distributions (for an arbitrary 
underlying graph), which has multiple shape parameters and contains the G-Wishart dis¬ 
tributions as a special case. These distributions form a family of standard conjugate prior 
distributions for Gaussian concentration graph models. Developing methods for efficient 
posterior draws from generalized G-Wishart distributions is crucial for scalable Bayesian in¬ 
ference. We proceed to introduce the class of Generalized Bartlett (GB) graphs, and derive 
an efficient Gibbs sampling algorithm (with Gaussian or GIG conditionals) to simulate from 
generalized G-Wishart distributions corresponding to a GB graph. The class of Generalized 
Bartlett graphs contains decomposable graphs as a special case, but is substantially larger 
than the class of decomposable graphs. For example, any cycle of length greater than 3 
is Generalized Bartlett, but is not decomposable. Our approach has the flexibility of using 
multiple shape parameters (as opposed to the single parameter G-Wishart), but goes beyond 
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the class of decomposable graphs without losing tractability. 

For the generalized G-Wishart case, the conditional densities for any maximal clique of 
hi are intractable to sample from. Hence, the sampling approaches in (Asci and Piccioni, 
2007; Lenkoski, 2013) for G-Wisharts on a general graph do not extend to the generalized 
G-Wishart. On the other hand, we show that the accept-reject and Metropolis-Hastings 
based methods in Wang and Carvalho (2010) and Mitsakakis et al. (2011) can be easily 
extended to the generalized G-Wishart case. We compare the performance and scalability 
of these two approaches with our Gibbs sampler in Section 7.2.1 and Section 7.2.2. 

The rest of the paper is organized as follows. Section 2 contains a brief overview of 
relevant concepts from graph theory and matrix theory. In Section 3 and Section 4, we 
define generalized G-Wishart distributions and GB graphs respectively, and establish some 
basic properties. In Section 5, we derive a tractable Gibbs sampling algorithm to simulate 
from the generalized G-Wishart distribution corresponding to a GB graph. Section 6 pro¬ 
vides additional examples and properties of GB graphs. Section 7 contains a comprehensive 
simulation and real data analysis study for the Bayesian approach developed in the paper. 
The proofs of most of the technical results in the paper and additional numerical work are 
provided in the Supplemental Document. 

2 Preliminaries 

2.1 Graph theoretic preliminaries 

For any positive integer p. let := {1,2, ••• ,p}. Let G = (V, E) denote an undirected 
graph, where V represents the finite vertex set and E C V x V denotes the corresponding 
edge set. A function o is defined to be an ordering of V if a is a bijection from V to Niyi. 
An undirected graph G = (V, E ) and an ordering a of V can be used to construct an ordered 
graph G a = (V,o,E a ), where (i,j) G E a if and only if (<r _1 (i), a -1 (j)) G E. 

Definition 1. An undirected graph G = (V, E) is called decomposable if it does not have 
a cycle of length greater than or equal to 4 as an induced subgraph. 

Such graphs are also called triangulated, or chordal graphs. A useful concept associated 
to decomposable graphs is that of a perfect elimination ordering (see Lauritzen (1996)). 

Definition 2. An ordering a for an undirected graph G = (V, E) is defined to be a perfect 
elimination ordering if for each j G IV|y|, the set {j } U {i : i > j, ( i,j ) G E a } forms a 
clique. 
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In fact, an undirected graph G is decomposable if and only if it has a perfect elimination 
ordering (see Paulsen et al. (1989)). 

Definition 3. For a given undirected graph G = (V,E), G = (V , E ) is called a decomposable 
cover of G if G is decomposable and E C E. 

Decomposable covers are also known as triangulations in graph theory literature (see Parraal 
and Schefflerb (1997)). 

2.2 Matrix theoretic preliminaries 

We denote the set of p x p symmetric matrices by M p , and the space of p x p positive definite 
symmetric matrices by M+. Given an ordered graph G a , we define 

P Gct = (D G M+! : Qij = 0 if (i,j) £ E a }, 

and 

C Ga = {L e M| V | : L u = 1, Lij = 0 for i < j or (i,j) £ E a }. 

The space P Gct is a submanifold of the space of \V\ x V" positive definite matrices, where 
the elements are restricted to be zero whenever the corresponding edge is missing from E a . 
Similarly the space £ Gct is a subspace of lower triangular matrices with diagonal entries 
equal to 1, such that the elements in the lower triangle are restricted to be zero whenever 
the corresponding edge is missing from E a . 

A positive definite matrix D can be uniquely expressed as D = LDL T , where L is a lower 
triangular matrix with diagonal entries equal to 1, and D is a diagonal matrix with positive 
diagonal entries. Such a decomposition is known as the modified Cholesky decomposition 
of fl (see for example Daniels and Pourahmadi (2002)). Paulsen et al. (1989) showed that 
if D G P g„i then L G £ G(T if and only if G is decomposable and a is a perfect elimination 
ordering. If either of these two conditions is violated, then the sparsity pattern in L is a strict 
subset of the sparsity pattern in D. The entries (i,j) ^ E a (with i > j ) such that L t j is not 
(functionally) zero, are known as “fill-in” entries. The problem of finding an ordering which 
minimizes the number of fill-in entries is well-known and well-studied in numerical analysis 
and in computer science/discrete mathematics. Although this problem is NP-hard, several 
effective greedy algorithms for reducing the number of fill-in entries have been developed and 
implemented in standard software such as MATLAB and R (see Davis (2006) for instance). 

In subsequent sections, we will consider a reparametrization from (the inverse covariance) 
matrix to its modified Cholesky decomposition. Such a reparametrization inherently 
assumes an ordering of the variables. In many applications (such as longitudinal data), a 
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natural ordering is available. In the absence of a natural ordering, one can choose a fill- 
reducing ordering using one of the available fill-reducing algorithms mentioned previously. 
We will see that a fill-reducing ordering will help in reducing the computational complexity 
of proposed Markov chain Monte Carlo procedures. 


2.3 Undirected graphical models and (G-Wishart distribution 

Let G = {V, E) be an undirected graph with V" = p, and a be an ordering of V. The 
undirected graphical model corresponding to the the ordered graph G CT is the family of 
distributions 

J = {MVN p (0,n~ 1 ) :QeP G J. 

Let Yf, Y 2 ,..., Y n be i.i.d. observations from a distribution in J. Note that the joint density 
of Yi, Y 2 ,...,Y n given C is given by 


(V2n) n P 


exp 


(-^tr(OS))- 


The G-Wishart distribution on P^g is a natural choice of prior for hi (see Dawid and Lauritzen 
(1993) and Roverato (2000)). The density of the G-Wishart distribution with parameters 
5 > 0 and U G M+ is proportional to 


\Q\ 2 exp ( — -tr(QU) 


Thus the posterior density of fl given Y 1} Y 2 ,... ,Y n is proportional to 

|n|^ exp ^—^tr(Q(nS + U))^, 


and corresponds to a G-Wishart distribution with parameters (n + 5) and (U + nS ), which 
implies that the family of G-Wishart priors are conjugate for the family of distributions J. 


3 Generalized G-Wishart distributions 

In this section we propose a generalization of the G-Wishart distribution that is endowed 
with multiple shape parameters, and contains the G-Wishart family as a special case. We 
shall show in later sections that the flexibility offered by the multiple shape parameters is 
very useful in high dimensional settings. 
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3.1 Definition 


We now define a multiple shape parameter generalization of the G-Wishart distribution for 
a general graph G. To do this, we transform the matrix 12 to its Cholesky decomposition. 
Consider the modified Cholesky decomposition 12 = LDL T , where L is a lower triangular 
matrix with diagonal entries equal to 1, and D is a diagonal matrix with positive diagonal en¬ 
tries. The (unnormalized) density of the generalized G-Wishart distribution with parameters 
<5 = (<5i, 62 , ■ ■ ■ , d p ) G M+ and U G M+ is given by 

7^(12) = fjj A*(12)^ j exp ^-^r(12f/)^. (3.1) 

We note that other generalizations of the Wishart have also been considered in Ben-David 
et al. (2015); Daniels and Pourahmadi (2002); Dawid and Lauritzen (1993); Khare and 
Rajaratnam (2011); Letac and Massam (2007). It is clear that the G-Wishart density arises 
as a special case of the generalized G-Wishart (by considering all the Si s to be equal and 
noting that |12| = nf=i Ai), and that the family of generalized G-Wishart distributions 
defined above is a conjugate family of prior distributions for undirected graphical models. 
In fact, the posterior density of 12 corresponds to a generalized G-Wishart distribution with 
parameters n + Si, ■ ■ ■ ,n + S p and (U + nS ). 

3.2 Some properties of the generalized G-Wishart distribution 

We now proceed to derive properties of the generalized G-Wishart distribution. To do so, 
we transform 12 to its modified Cholesky decomposition 12 = LDL T as defined in Section 
2.2. 

We define Lj = { L l} \i > j and ( i,j ) G E a } to be the set of functionally independent 
elements of L. Then the transformation 12 — > ( Lj , D ) is a Injection from to x 
with Jacobian equal to nj= =i where zq := |{* : i > j, (■ i,j ) G E a }\ for j = 1, 2,... ,p. 

Then the (unnormalized) generalized G-Wishart density for (. Lj,D ) is given by 

f P Sj+ 2 Vj \ / 1 P P min(ij) \ 

K,s( L h D ) = 2 ) x ex P L ikLjkD k (3.2) 

\i= 1 / \ i =1 3 =1 fc=l J 

We first establish sufficient conditions for the density tt^j S to be proper. 
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Theorem 1. If U is positive definite and <5* > 0 Vi = 1,... ,p, 


J nfi s (L I ,D)d{L I ,D) < oo. 

Also under these conditions, E [D t f < oo, Wi,j. 

The proof of Theorem 1 is provided in Supplemental Section A.l. Under the conditions in 
Theorem 1, we will refer to the normalized version of as nu,s- 

From Roverato (2000), if D follows G-Wishart with parameters (U,S) then for (i,j) G E 
or i = j, 

E((n-%) = If 

We now provide an extension of this result for the case of generalized G-Wishart distributions. 

Theorem 2. Let = LDL T G Pg ct be generalized G-Wishart with parameters (U,S) for 
some U G Mfi,8 G K+. Denote Q*. as the k x k principal submatrix of Q, and let [fl^ 1 ] 0 
denote the p xp matrix with Qf 1 as its appropriate kxk principal submatrix and rest of the 
elements equal to 0. Define the matrix Uq,, as {Uc a )ij = Uij x 1 {(ij)eE a }- If 8 k > 4, \/k, then 


E 


Y. (4 - 4+i) KT 

_ k<p 


U G „. 


The proof of Theorem 2 is quite detailed and technical and is thus provided in Supplemental 
Section A.2. Theorem 2 provides a useful tool to monitor convergence of any Markov chain 
Monte Carlo method for sampling from 7 tjj,s (and particularly the Gibbs sampler introduced 
in Section 5) . 

We also undertake a comparison between generalized G-Wishart distribution and the 
useful priors introduced by Letac and Massarn (2007) for the case of decomposable graphs. 
A careful analysis demonstrates that the generalized G-Wishart and Letac-Massam priors 
are quite different for decomposable graphs. The generalized G-Wishart coincides with the 
Letac-Massam Type II Wishart in the special case when G is homogeneous. See Supplemental 
Section B for details. 


4 Generalized Bartlett graphs 

As discussed earlier, the class of decomposable graphs is endowed with many properties help¬ 
ful for closed form computation of posterior quantities. The assumption of decomposability 
can be rather restrictive in higher dimensions, as they constitute a very small fraction of all 



graphs see (Figure 2a). We develop in this section a class of graphs, which is substantially 
larger than the class of decomposable graphs. We will show in later sections that for this 
class of graphs, we can generate posterior samples from the generalized G-Wishart, using a 
tractable Gibbs sampling algorithm. 

4.1 Preliminaries and Definitions 

We now provide the definition of Generalized Bartlett graphs. First consider the following 
procedure to obtain a decomposable cover (see Section 2.1) of an arbitrary ordered graph 
G a = (V,E a ). 

Algorithm 1 Triangulation Algorithm for an unordered graph G 

Denote Gq = G. 

while i < p — 2 do 

E? = U {(u,»)|ct(u) > tr( v) > i e N|v & (i>,ct _1 (*)) 6 Ef_ ,} 

G° = (V, Ef) 

end while 


We use the above algorithm to construct a decomposable cover for G as follows. Let D a (G) : = 
Gp_ 2 , and let D a (E) denote the edge set of D a (G). It follows by construction that the 
ordering a is a perfect vertex elimination scheme for D a (G). Hence, D a (G) is a decomposable 
cover for G. Note that, two different orderings may give rise to different decomposable covers. 
We now define Generalized Bartlett graphs. 

Definition 4. An undirected graph G = (V, E ) is said to be a Generalized Bartlett graph if 
there exists an ordering a, with the property that there does not exist vertices u,v,w G V 
satisfying (u,v), (v,w), ( u,w ) ^ E and (u,v), (v,w), ( u,w ) G D a (E). 

In such a case (i.e., when this property is satisfied), a is called a Generalized Bartlett ordering 
of G. When it exists, the Generalized Bartlett ordering may not be unique. The following 
lemma helps in proving an alternate characterization of Generalized Bartlett graphs, one 
that does not depend on any ordering of the vertices. The lemma shows that Algorithm 5 
leads to a collection of minimal decomposable covers, in the sense that the edge set of any 
decomposable cover of G has to contain D a (E) for some ordering a. 

Lemma 1. For any undirected graph G = (V, E) and a decomposable cover G = ( V,E ) of 
G, 3 an ordering a of V s.t., D a (E) C E. 

Proof. Let G = (V, E) be a decomposable cover of G — (V,E). Since G is decomposable 
let u be the perfect elimination ordering of it. We shall prove inductively that for i in 
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{0,1,... ,p — 2}, Ef c E. Since D a (E) = Ef_ 2 , that will prove this lemma. It is trivial to 
note that E = Ef C E. Let us assume that the claim holds for Ef_ v Consider any r > s > i 
such that (cr _1 (r), cr _1 (s)) G Ef \ Ef_ v Thus (a _1 (r), cr _1 (i)), (cr _1 (s), cr _1 (i)) G Ef_ x C E. 
Since a is the perfect elimination ordering for G, (cr _1 (r), cr _1 (s)) G E. Thus Ef C E which 
completes the induction step and proves the lemma. □ 

We now provide a second definition of Generalized Bartlett graphs. 

Lemma 2. An undirected graph G satisfies the Generalized Bartlett property if and only if 
G = (V, E) has a decomposable cover G = (V, E) such that every triangle in E contains an 
edge from E. That is for any u,v,w G V such that (u,v), (v,w), ( u,w ) G E, at least one of 
(u,v),(v,w),(u,w) belongs to E. 

Proof. If G satisfies the Generalized Bartlett property then by definition 3 an ordering a of 
V such that any triangle in D a (E) contains an edge from E. In that case we can simply 
take E = D a (E). On the other hand, let G = (V, E) be a decomposable cover of G such 
that every triangle in E contains an edge from E. By Lemma 1, 3 an ordering a of V, such 
that D a (E) C E. Thus any triangle in D a (E) is also a triangle in E and hence has an edge 
in E. This makes o the Generalized Bartlett ordering and G an Generalized Bartlett graph. 

□ 

In the subsequent arguments we will also refer to Generalized Bartlett graphs as satisfying 
the Generalized Bartlett property. Some common Generalized Bartlett graphs are: 

1. All decomposable graphs (follows by using G = G in Lemma 2). 

2. Any cycle (see Section 6.1 for a proof). This is the simplest example of a non- 
decomposable graph which satisfies the Generalized Bartlett property. Also note that 
4 cycle is the simplest non-decomposable graph. 

3. Any lattice with less than 4 rows or less than 4 columns (see Section 6.2 for a proof). 
Such graphs are useful in spatial applications (see Section 7.3). 

It is a natural question to ask how much larger the class of Generalized Bartlett graphs 
is compared to the class of decomposable graphs. It is quite difficult to obtain a closed 
form expression for the exact (or approximate) number of connected decomposable graphs 
(or Generalized Bartlett graphs) with a given number of vertices. However, a list of all 
possible connected non-isomorphic graphs having at most 10 vertices is available at http: 
//cs. £Lnu.edu.au/~bdm/data/graphs.html. Using this list, we computed the number of 
decomposable and Generalized Bartlett graphs with at most 10 vertices. Figure 2a provides 
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Figure 1: (left and middle) Examples of Generalized Bartlett graphs, (right) The bipartite 
graph iF 33 : the only 6-vertex connected graph which is not Generalized Bartlett. 

a graphical comparison of these proportions, and Table 3a gives the actual values of these 
proportions. It is quite clear that the proportion of Generalized Bartlett graphs is much 
larger than the proportion of decomposable graphs. As expected, the proportions of both 
classes of graphs decreases as the total number of vertices increases. However, the rate of 
decrease in the proportions is much larger for decomposable graphs. For example, less than 
0.02% of connected isomorphic graphs with 10 vertices is decomposable, but more than 85% 
of connected isomorphic graphs with 10 vertices is Generalized Bartlett. In this case the 
number of Generalized Bartlett graphs are approximately 100 times larger. 



Number of Vertices 

(a) Plot comparing the percentage 
of Generalized Bartlett graphs with 
decomposable graphs 


2 


5 


3k-4 


3k-l 


3k+2 
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(b) A 4 x 4 grid where the dotted lines 
represent the extra edges in its Gener¬ 
alized Bartlett cover. 


(a) Percentages for Generalized Bartlett graphs 
and decomposable graphs among connected non¬ 
isomorphic graphs with at most 10 vertices. 


Figure 3 


4.2 Clique Sum Property of Generalized Bartlett graphs 

An unordered graph G = ( V , E ) is said to have a decomposition into components G\ = 
(Vi ,Ei) and G 2 = (V 2 ,E 2 ) if the vertex set V can be decomposed as h = V\ U V 2 where 
(V\ — V 2 ) U (V 2 — Vi) % 0 such that the induced subgraph on V\ fl V 2 7 ^ 0 is complete, and 
Vi D V 2 separates V\ — V 2 from V 2 — V (i.e., if u e V\ — V 2 and w e V 2 — V\ then (u, w) f E). 
If G cannot be decomposed in the above manner it is called a prime graph. Hence any 
graph G is either prime or can be broken down into several prime components by repeated 
application of the above procedure. It is well known that a graph G is decomposable iff all of 
its prime components are complete. The following lemma provides a similar characterization 
for Generalized Bartlett graphs. 

Lemma 3. If all the prime components of a graph are Generalized Bartlett then the graph 
is also Generalized Bartlett. 

Proof. Note that, it is enough to prove the theorem, for a graph with two prime components. 
Let G = (V, E) be an undirected graph with V — V\ U V 2 such that induced subgraph on 
Vi fl V 2 7 ^ 0 is complete, and V\ fl V 2 separates V — V 2 and V 2 — V\. Let G\ = (V \, E\) 
and G 2 = ( V 2 ,E 2 ) be the corresponding induced subgraphs which are Generalized Bartlett. 
Then by Lemma 2, we can construct decomposable covers G\ = (Vi,Ei) and G 2 = (V 2 ,E 2 ) 
of G\ and G 2 respectively, such that every triangle in E j contains an edge in E t for i — 1,2. 
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Define E — E\ \J E 2 and G = (V, E). Note that G is a cover for G. We claim that G is 
decomposable. Suppose to the contrary, that for for some n> 4 there exists an induced cycle 
in G on a set of n vertices, say ui, 112 , ■ ■ ■, u n . Since G\ and G 2 are both decomposable, all of 
{ui,u 2 , . ■ ■ ,u n } cannot belong to exclusively in V\ or in V 2 . Hence, there exist 1 < i, j < n 
such that Mj e V \ — V 2 and Uj G V 2 — V\. Since V\ — V 2 and V 2 — V\ are separated by V\ fl V 2 , 
( Ui , Uj) ^ E. Thus the subgraph induced by G on the set of vertices {u\, u 2 , • • • , u n } contains 
two paths, both arising from u t and ending in Uj and intersecting no where in between. Let 
{ui, Ui t ,..., u ip , Uj} be one of those paths. If {ut, u ii: ..., u ip , Uj} C (V\ — V 2 ) U V 2 — Vi, then 
there exists points in V\ — V 2 and V 2 — V\ connected to each other in G, which is not possible. 
Thus 3uk € {ui x , ... ,Ui p } s.t. Uk G Vi fl V 2 . Similarly 3 Uk> G V\ fl V 2 corresponding to the 
second path. Since V\ fl V 2 is complete (uk,Uk>) G E. Hence (■ Uk,Uk>) is a chord in the cycle 
Mi, u 2 ,...., u n giving us a contradiction. Hence, G is a decomposable cover of G. 

To prove that G is a Generalized Bartlett graph we shall prove that every triangle in 
the decomposable cover G contains an edge in G. Let us assume to the contrary that for 
u,v,w G V, (u, v), (u, w), (■ v , w) G E but (u, v ), (u, w ), (v, w) ^ E. Since every triangle in G t 
has at least one edge from G, for z = 1,2, it follows that u, v, w all cannot belong exclusively 
to Vi or V 2 . Without loss of generality, let u G V\ — V 2 and v G V 2 — V\. This implies 
(u,v) ^ E giving us a contradiction. Thus G is Generalized Bartlett. □ 

4.3 Constructing Generalized Bartlett covers 

Given an ordered graph G a = (V, £3), Algorithm 2 below provides a Generalized Bartlett 
graph G a = (V, E a ) such that E a D E a . Such a graph is referred to as a Generalized Bartlett 
cover of G a . 

Algorithm 2 Construction of Generalized Bartlett cover 
Set E := E. 

while 3 i > j, (<r^ 1 (i), a -1 ^')) ^ E such that L,j when expressed as a polynomial violates 
Property A or B do 

end while 


Recall from Section 6.2 that a 4 x 4 grid is the smallest example of a grid which is not a 
Generalized Bartlett graph. The Generalized Bartlett cover for a 4 x 4 grid using Algorithm 
2 is provided in Figure 3b. Note that this cover has only three extra edges. 
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5 A tractable Gibbs sampler for generalized G-Wisharts 


In Section 4 we studied the graph theoretic properties of GB graphs. In this section we shall 
investigate the statistical/properties of GB graphs. In particular, we develop a tractable 
Gibbs sampling algorithm to sample from the generalized G'-Wishart distribution when the 
underlying graph is Generalized Bartlett. The first step in achieving this goal requires 
considering a further transformation of the Cholesky parameter (Lj, D) from Section 3.2. 


5.1 A reparametrization of the Cholesky parameter 

Let G = (V, E) be an undirected graph with V" — p, and a an ordering for G. Let 12 = LDL T 
be the modified Cholesky decomposition of 12 e Fey- To facilitate our analysis, we consider 
a one-to-one transformation of (Lj, D) defined as follows: 


(D\, D 2 , ..., D p ) — > (Di, D 2 , .. . , D p ) 

where D\ = D\ and Dk = ~j^~i f° r 2 < A; < p. The following lemma shows that terms of the 
form L^LjkDk can be expressed as a polynomial in the entries of Lj and D (with negative 
powers allowed for entries of D). 

Lemma 4. For 1 <k<j<i<p, terms of the form LikLjkDk, which appear in the 
modified Cholesky expansion of Q , are either functionally zero, or can be expressed as a sum 
of terms, where each term has the following form: 

±( n b-j x [nsj] (5.i) 

\{r>s,(r,s)^E,r<i,s<j} J \k =1 / 

Proof. Note that 

i 

Di = D k 

k= 1 

for all i, and the Jacobian of this transformation is flfc=i D k k - Hence, the posterior density 
of (Li, D) is proportional to 

( V ~ \ / j V V min(ij) k J\ 

II D°‘ x exp -- y y (nS a + U„) V L ik L jt [J D, , (5.2) 

3=1 J y i =1 j =1 k =1 1=1 J 

where a k = (p — k) + Y^=k n+Sl ^ 2ni for 1 < k < p. Note that if i > j and (i,j) f E, then 
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^ij = Yfk=i LikLjk.Dk = o, which implies 


Lij 


Et=i LaX 


ik lj jkXfc 


Dj 


3 - 1 3 


^ ikLjk -D; 

/c=l l=k +1 


Making repeated substitutions in the RHS of the above equation, it follows that the entry 
is either functionally zero, or can be expressed as a sum of terms, where each term looks 

like 

±( n rs-] x (n 5 m (5.3) 

y{r>s,(r,s)eE CT } J \k =1 / 

for suitable non-negative integers c rs , and non-positive integers dk . It is easy to see that 


c rs = 0 for (r, s) G E with r > i or s > j, 
dj = —1 and dk = 0 for k > j. 

Hence, L l} can be expressed as a polynomial in entries of Lj and D~ l . The results now 
follows from (5.3). □ 

Note that for every i > j with (i,j) £ E a , the functionally dependent entry L tJ can be 
expressed in terms of Lj and D as in (5.3). The above analysis indicates that, in general, 
the posterior is a complicated function of ( Lj,D). However, we will show that if G is a 
Generalized Bartlett graph, and cr is a Generalized Bartlett ordering for G, then the full 
conditional posterior distributions of all individual entries of (Lj, D) are either Gaussian 
or Generalized Inverse Gaussian distributions (and therefore easy to sample from). This 
property will then be used to derive a Gibbs sampling algorithm to sample from the posterior 
density in (5.2). 


5.2 The Gibbs sampler 

We now derive a Gibbs sampling algorithm to sample from the posterior density in (5.2). 
We start by defining two properties which will be crucial to the development of the Gibbs 
sampler. 

Definition 5. Let G a = (V, a, E a ) be an ordered graph, and for e P Q a , = LDL T be the 

modified Cholesky decomposition. 

1. The ordered graph G a is defined to have Property-A if for every i > j such that 
( i,j ) ^ E a , the following holds: for every r > s with ( r,s) G E a , L t] is a linear 
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function of L rs (keeping other entries of Lj and D fixed). In other words, in (5.3), c rs 
can only be 0 or 1 for every r > s with (r, s) E E a . 

2. The ordered graph G a is defined to have Property-B if for every i > j such that 
( i,j ) ^ E a , the following holds: for every 1 < k < p, L ^ is a linear function of Df 1 
(keeping other entries of Lj and D fixed). In other words, in (5.3), d k can only be 0 
or —1 for every 1 < k < p. 

We now state three lemmas which will be useful in our analysis. The first lemma pro¬ 
vides an equivalent formulation of Property-B. The proofs of these lemmas are provided in 
Supplemental Section A.4, A.5 and A.6 respectively. 

Lemma 5. The following statements are equivalent. 

(a) The ordered graph G a satisfies Property-B. 

(b) For every 1 <k<j<i<p, L ik Lj k D k can be expressed as a polynomial in entries of 
Lj and D (with negative powers allowed for entries of D). Furthermore for every term 
in the expansion of L ik Lj k D k , the power of any entry of D is either 0,1 or —1. 

Let i > j with (i, j) (f E a . As noted above, for kl < k < j, both L ik Lj k D k and L ik iLj k >D k > 
can be expressed as polynomials in entries of Lj and D (with negative powers allowed for 
entries of id). 

Lemma 6. Both L ik Lj k D k and L ik iLj k iD k i are either functionally zero, or any term in 
the expansion of L ik Lj k D k cannot be the exact negative (functionally) of any term in the 
expansion of L ik iLj k /D k /. 

The next lemma shows that Generalized Bartlett graphs satisfies Properties A and B. 

Lemma 7. Let G = (V, E) be a Generalized Bartlett graph, and a be a Generalized Bartlett 
ordering for G. Then, the ordered graph G a satisfies both Property A and B. 

We are now ready to state and prove the main result of this section, which provides a Gibbs 
sampler for the posterior density in (5.2). 

Theorem 3. Let G = (V, E) be a Generalized Bartlett Graph, and cr be a Generalized 
Bartlett ordering for G. Suppose e Pg ct follows a generalized G-Wishart distribution with 
parameters U and S for some positive definite matrix U and S > 0. If ft = LDL T is the 
modified Cholesky decomposition of ft, and we define D\ = D\, D k = for k > 2, then 

• the conditional posterior density of the independent entry Lij, given all other entries 
of Lj and D, is univariate normal, 
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• the conditional posterior density of D k given Lj and other entries of {D k >} k i^ k is either 
a Generalized Inverse Gaussian or Gamma. 

Proof. Note that for i > j, L{j can be expressed as a polynomial in the entries of Lj 
and D _1 . Recall that D a (G) = (V, D a (E)) is the decomposable cover obtained by the 
triangulation process described in Algorithm 5. We first establish that for i > j, L tJ is 
functionally non-zero iff (<x _ 1 (i), < 7 _ 1 (j)) G D a (E). We begin by noticing that at each step in 
the construction of G i, G 2 , ■ ■ ■, G p -1 we are adding some extra edges to 1 to get Gi , but 
never deducting anything. So if was an independent entry, i.e. (a _ 1 (i), <r - 1 (j)) G E 0 = E, 
then (cr _ 1 (i),cr _ 1 (j)) G E p _ 2 = D a (E). 

Now lets assume to the contrary that L t j is the first dependent but functionally non¬ 
zero entry s.t. (cr 1 (i) , cr _ 1 (j)) £ D a (E). Since L l3 is dependent but non-zero 3 k < j s.t. 
LikLjkjf: 7 ^ 0 appears in the expansion of L tJ . Now La- can be independent and hence 
(cr ^ 1 (z), a~ l {k)) G E C E k . Otherwise L ik is non-zero dependent. Since Ljj(the first non¬ 
zero dependent not in D a (E)) comes after L ik , (<r _ 1 (i), a~ l {k)) G D a (E). We recall that, 
for any l, while constructing Gi, we only join vertices higher than l. Thus (a~ 1 (i),cr^ 1 (k)) 
must have been joined before construction of G k , i.e. (<x _ 1 (i), a~ 1 (k)) G E k - 1 . By a similar 
argument a~ 1 (k)) G E k _ 1 . Thus (<x _ 1 (i), cr' 1 ^')) G E k C D a (E) and we have a 

contradiction. Hence, 

Uj± 0 => 

To prove the reverse implication we note that for r > s, (<7 -1 (r),<x” 1 (s)) G E = E 0 implies 
that L rs is independent and hence 7 ^ 0. Now we use induction and assume that the claim 
holds upto Ei_i. If for r > s > i, (cr _ 1 (r), cr _ 1 (s)) G E t \ E^_ 1 then (<x -1 (r), <x _1 (s)) E 
and hence L ri L si ^ appears in the expansion of L rs . Since (cr _ 1 (r), <x _1 (s)) G E t \ E^i, 
(<x _1 (r), a _1 (i)), (cr _1 (s), cr _1 (i)) G E^\ and by assumption L ri 0 and L si 0 which with 
the help of Lemma 6 implies L rs 0. Thus the assumption holds for E), which completes 
the induction step. 

It follows by (5.2) and Property-A that for every i > j, (■ i,j ) G E, the conditional posterior 
density of L tJ given all other entries of Lj and D is proportional to 

exp (-u,- b i:j ) 2 ), 

for appropriate constants and b tJ . Hence the conditional posterior density of is a 
Gaussian density. Similarly, it follows from (5.2) and Property-B that for every 1 < k < p, 
the conditional posterior density of D k given all entries of Lj and {D k >} k >^ k is proportional 
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to 



for appropriate constants a.k and bk■ Hence the conditional posterior density of D\ is Gamma, 
and for k > 2, the conditional posterior density of D is a Generalized Inverse Gaussian 
density. □ 

The results in Theorem 3 can be used to construct a Gibbs sampling algorithm, where the 
iterations involve sequentially sampling from the conditional densities of each element of 
(. Lj,D ). It is well known that the joint posterior density of (Lj,D) is invariant for the 
Gibbs transition density. Since the Gaussian density is supported on the entire real line, 
and the Generalized Inverse Gaussian density is supported on the entire positive real line, it 
follows that the Markov transition density of the Gibbs sampler is strictly positive. Hence, 
the corresponding Markov chain is aperiodic and A-irreducible where A is the Lebesgue 
measure on M) Li ^ x M+(Meyn and Tweedie (1993), Pg 87). Also, the existence of an invariant 
probability density together with A-irreducibility imply that the chain is positive Harris 
recurrent (see Asmussen and Glynn (2011) for instance). We formalize the convergence of our 
Gibbs sampler below. The following lemma on the convergence of the Gibbs sampling Markov 
chain facilitates computation of expected values for generalized G'-Wishart distributions. 

Lemma 8. Let G = (V, E) be a Generalized Bartlett graph, and a be a Generalized Bartlett 
ordering for G. Then, the Markov chain corresponding to the Gibbs sampling algorithm in 
Theorem 3 is positive Harris recurrent. 

5.3 Maximality of Generalized Bartlett graphs 

Note that the Gibbs sampling algorithm described in Theorem 3 is feasible only if Property- 
A and Property-B hold. The following theorem shows that if a graph is not Generalized 
Bartlett, then at least one of Property-A and Property-B does not hold. 

Theorem 4. If an ordered graph G a satisfies Property-A and Property-B, then the graph G 
is a Generalized Bartlett graph and a is a Generalized Bartlett ordering for G. 

Proof. Suppose there exists i > j > k s.t -(i,j),(i,k),(J,k) f E but (i,j),(i,k),(j,k) E 
D a (E) i.e. Lij ^ ^ 0, L jk ^ 0. Hence L^Lj^jp: is in the expansion of L iy The power 

of Dk in the expansion of L and Lj *. is —1. ^ = Df+\ • ■ ■ Dj 1 - Also we know that, no 

n D' 

term of LikLjkjy; can cancel with any term of L^Ljk'-^;. Hence in the expansion of L l} the 
power of Dk is —2. Thus Property-B is violated. The result now follows by Definition 7. 

□ 
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Theorem 4 demonstrates that the class of Generalized Bartlett graphs is maximal, in the 
sense that the conditional distributions considered in Theorem 3 are Gaussian/Generalized- 
Inverse-Gaussian only if the underlying graph is Generalized Bartlett. In other words the 
above tractability is lost for graphs outside the Generalized Bartlett class. 


5.4 Improving efficiency using decomposable subgraphs 

It is generally expected that ‘blocking’ or ‘grouping’ improves the speed of convergence of 
Gibbs samplers (see Liu et al. (1994)). Suppose fl = LDL 7 follows a generalized G'-Wishart 
distribution. In this section, we will show that under appropriate conditions, the conditional 
density of a block of variables in (L/, D) (given the other variables) is multivariate normal. 
Based on the discussion above, this result can be used to sample more efficiently from the 
joint density of (. Lj,D). 

Lemma 9. Let G a = ( V. , E a ) be a Generalized Bartlett graph withp vertices and fl(= LDL T ) 
follows generalized G-Wishart with parameters ( U , S). Suppose that for some 1 < pi < p, the 
induced subgraph of G a corresponding to the vertices {p\ + 1,... ,p} is decomposable with a 
perfect elimination ordering. Then {Lij\pi < j < i < p, ( i,j ) e E^^Lj \ {Lij\pi < j < i < 
p, ( i,j ) G E a },D) follows a multivariate normal distribution. 

Proof. We partition the matrix L as 


L = 



0 

L3 


where L\ has dimension pi x pi and correspondingly, 


U 


* u n, D 

U-2 U 3 ) 


Ei 0 \ 

0 D -2 J 


Note that the density of ( Lj , D ) is proportional to 


f\DT S ' l2 ew(-\tr(LDL T U) 


A sample calculation gives: 

tr{LDL T U) = tr(LiDiLlUi) + 2tr(L 1 D 1 L^C/ 2 ) + tr(L 2 DiL T 2 U 3 ) + tr(D 2 LjU 3 L 3 ). (5.4) 
Consider i > i! > pi such that (i, i') £ E a . Since the induced subgraph on {pi + 1, • • • ,p} is 
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a decomposable graph with a perfect elimination ordering, there does not exist Pi < j < %' 
such that (i, j), G E a . It follows that = — Yl]=i LijLi'jDj/Di' is a function of 

entries in (Li, L 2 , D). Hence, all the dependent entries in L 3 are functions of (Li, L 2 , D). It 
follows by ( 5 . 4 ) that given (Li, L 2 , D), tr ( LDL T U ) is a quadratic form in the independent 
entries of L 3 . Hence, the log of the conditional density of {L.^pi < j < i < p, ( i,j ) G E 
given the other entries in (Lj, D) is a quadratic form. This proves the required result. □ 

5.5 Closed form expressions for decomposable graphs 

A closed form expression for the mean of can be obtained if G is assumed to be a decom¬ 
posable graph. For {5; > 0|i = 1,... ,p} and U positive definite, the generalized G-Wishart 
density on Pg ct is, 

7 py,< 5 (n) oc JJ Dj 2 exp f—-tr(QU)\ where G Pg ct 
3 =1 ' ' ' 

Let us define, 

A f yj := {i : i > j, ( i,j) G E a } v 3 := \Af yj \ 

U) := {U t] \i G J\f yj } U yj := G J\T yj } 

Ci := -(U y >y l U y Cj := U 3j . - U yT (U yj )~ l U y > 0 

Also let H be the diagonal matrix with {k, /c)-th element as 5 k+ f k+2 and e is a p x p 
matrix whose (k,j )~th element is ekj if j G A f yk , is 1 if j = k and 0 otherwise. The following 
theorem provides closed form expectations of the elements of the matrix Q. 

Theorem 5. If G a is decomposable where the vertices have been ordered by an perfect elim¬ 
ination ordering, and = LDL T G Pg ct is generalized G-Wishart with parameters ( U,S ), 
then 

L j. |Dj ~ N ^ej, -— jy -— ^ and Dj ~ Gamma ^ J ^ J + 1 , ( 5 . 5 ) 

where Lj. are the independent entries ofthej-th column of L and D = diag{Di , D 2 ,..., D p ). 
Also, 

E(D) = Y^KU^)- 1 } 0 + e T He 

k<p 

The proof is provided in the Supplemental Section A.3. 
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6 Classes of Generalized Bartlett Graphs 


As mentioned earlier, the class of Generalized Bartlett graphs contains the class of de¬ 
composable graphs. In this section we will consider two naturally occurring examples of 
non-decomposable Generalized Bartlett graphs. We then provide schemes for combining a 
group of Generalized Bartlett graphs to produce a bigger Generalized Bartlett graph. 


6.1 The j9-cycle 

We show that the p-cyclc (with its standard ordering) satisfies Property-A and Property-B, 
and is hence a Generalized Bartlett graph. Let G a = (V,a,E a ), where V = N p , a is the 
identity permutation and 

Ea = ■ 1 < i,j <p, \i-j\ e (1,P — 1}}- 

The independent entries of L are L 2 1 , L 32 , ..., L p ( p _ p, L pi . After some straightforward alge¬ 
braic manipulations, the dependent entries of L can be calculated as follows: 


Lij — 0 


= (-i y-% 


pi 


[] L k ( k-i) 
\k=2 


Di 

D j 


if i p,i > j + 1 

if i = p and 2 < j < p — 2 


It is clear from the expressions in the above equation that Property-A and Property-B are 
satisfied and hence by Theorem 4, the p-cyclc is Generalized Bartlett. 

6.2 Grids 

A mxn grid is an undirected graph formed by the intersection of m rows and n columns where 
the vertices correspond to the p = mn intersection points and as a result m*(n—l)+n*(m—1 ) 
edges are formed. In this section we shall prove that for some particular ordering all n x 2 
and n x 3 grid are Generalized Bartlett. We order an n x 3 grid row wise starting from the 
top as shown in Figure 2b. 

Let G a = (V, E a ) be an ordered graph, 12 e Pcy, and 12 = LDL T denote the modified 
Cholesky decomposition of 12. Note that Property-A and Property-B have been defined for 
ordered graphs, but we extend these notions to polynomials as follows. For any polynomial 
p(Lj, D ) of (Lj, D), we say that p() satisfies Property-A if the power of any independent L tJ 
can be {0,1}. Similarly we say pQ satisfies Property-B if the power of any D k can be {—1, 0}. 
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We note that if j < i,i' and L VJ satisfies Property-B then also satisfies Property-B. 

Lemma 10. An n x 3 grid when ordered as above is Generalized Bartlett. 

6.3 Expansion property of Generalized Bartlett graphs 

In this section we develop two methods, which combine an arbitrary number of Generalized 
Bartlett graphs in a suitable manner to produce a larger Generalized Bartlett graph. 

6.3.1 Maximum vertex based expansion 

We start by proving a lemma which will be useful for further analysis. 

Lemma 11. Let G = (V, E) be a Generalized Bartlett graph. If V' is a subset of V and 
G' = (Vf E') is the corresponding subgraph then G' is also a Generalized Bartlett graph. 

Proof. Since G is Generalized Bartlett, let G be the decomposable cover of G such that 
any triangle in G has at least one edge in G. Let Gf be the induced subgraph of G for V'. 
Then Gf is decomposable since it is a induced subgraph of a decomposable graph. Also any 
triangle in Gf is a triangle in G and thus has atleast one edge in G and hence in G'. Thus by 
Lemma 2, G' is Generalized Bartlett. Moreover from the proof of Lemma 2 we can observe 
that if a is the Generalized Bartlett ordering for G then the same ordering works for G'. □ 

Let G = (V, E ) be a Generalized Bartlett graph with, V = {1,2,..., r}. Suppose we replace 
each vertex i of G, by a Generalized Bartlett graph Gi = (Vj, Ei), where for i — 1, 2 ,..., r, 
Vi = {pi-i + 1,... ,pi}. Here po = 0 and pi,P 2 , ■ ■ ■ ,Pr are the sizes of the r graphs. Note 
that the graphs being considered here are already ordered. For ease of exposition, we will 
suppress the ordered graph notation, and refer to the graphs as just G, G\, G 2 , .... 

Definition 6 . The expanded graph G = (V, E) is constructed from G using G 1 , G 2 , ■. ■, G r 
as follows, 

• V = Vi U V 2 U ... U V r = {1, 2,,.. ,p r } 

• ( k , /) G E iff either ( k , l) G E\ for some G t , or k = pi,l = pj for some 1 < i 7 ! j < r 
and ( i,j ) e E. 

Hence G is constructed from G by replacing the i-th vertex of G by Gi. An edge between i 
and j in G translates to an edge between the maximal vertices of Gi and Gj namely Pi and 
Pj. For any i — 1, 2,,.. ,p r , if pk-\ + 1 < i < Pk, the notation Graphii) shall denote Gk, and 
GraphJbefore(i ) shall denote U s <feG s = (U s <fcI4, U s< kE s ) 

Theorem 6 . The expanded graph G defined as above is Generalized Bartlett. 

The proof this theorem is given in the Supplemental Section A. 7. 
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6.3.2 Tree based expansion 


Consider a tree T with r vertices {vi,V 2 ,... ,v r }. For each Vi consider an arbitrary number 
of GB graphs say G $,..., G^ l \ We add an edge from Vi to each vertex in Gv} for every 
1 < j < rn . Denote the resulting graph by G = (V, E). Next the vertices in V are labeled in 
the following order. 


W n l) y—y (1) (~i( n 2 ) /^(l) fi(n r ) rp 

'“'U , ...,nr 1 ,cr 2 ,...,Ct 2 , . . . , CT r , . . . , Cr, r ,1 

The labeling is done in such a way that the induced ordering on each G^p is a GB ordering 
and every parent vertex in T gets a higher label than any of its children in T. Again for ease 
of exposition we will suppress the ordering notation and refer to the resulting ordered graph 
as G = (V,E). 

Theorem 7. The graph G defined and ordered as above is Generalized Bartlett. 

The proof of this theorem is provided in the Supplemental Section A.8. 

7 Illustrations and Applications 

We now illustrate the advantages of our Generalized Bartlett approach on both simulated and 
real data and demonstrate that the proposed GB method is scalable to significantly higher 
dimensions. In Section 7.1, we illustrate the advantage of having multiple shape parameters 
in the generalized G-Wishart distribution. In Section 7.2.1 and Section 7.2.2, we undertake 
a comparison of our algorithm with the accept-reject and Metropolis-Hastings approaches. 
Section 7.3 contains a real data analysis using data from a temperature study. Although 
the main focus of this paper is development of the flexible class of G-Wishart distributions, 
and tractable methods to sample from these distributions, we also illustrate that the meth¬ 
ods developed in this paper can be used for high-dimensional graphical model selection in 
conjunction with existing penalized likelihood methods (see Supplemental Sections C and 
D). 

7.1 Comparing C-Wishart with generalized C-Wisharts 

In this section, we present a simulation experiment to demonstrate that the multiple shape 
parameters in the generalized G-Wishart distribution can yield differential shrinkage and 
improved estimation as compared to the single parameter G-Wishart in higher dimensional 
setting. 
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Figure 4: Performance of G'-Wishart prior for different values of 5. The x-axis represents the 
chosen value of S. The y-axis represents Stein’s loss between the estimated posterior and the 
true value for and E respectively. 


For the purposes of this experiment we consider a Generalized Bartlett graph G with 
p = 1000 vertices, defined as follows. Let, 

b x = 50, b 2 = 150, b 3 = 450, b 4 = 1000 

Bi = {1,.. •, 49}, B 2 = (51,..., 149}, B 3 = (151,..., 449}, B A = (451,..., 999} 

A graph G is constructed by forming the 4-cyclc {bi,b 2 ,b 3 ,b±} and then connecting b t with 
all elements of Bi for i — 1,2, 3,4. An inverse covariance matrix O 0 e P G is then constructed 
by taking O 0 = L 0 D 0 L g, where ( D 0 )jj = A — bi -1 if j G {&.;} U B % . Here L 0 is a lower 
triangular matrix with independent entries equal to 0.5, and dependent entries chosen such 
that Q 0 £ Pg- 

We then generate n = 100 samples from a N(0, Eo = ^o 1 ) distribution. Let S denote 
the corresponding sample covariance matrix. Let c denote the mean of the diagonal entries 
of n * S. We first consider a G-Wishart prior for H with U = cl p and different choices of 5. 
Using the Gibbs sampler proposed in Section 5, the posterior mean for H (and E) is then 
computed for each choice of <5. Figure 4 depicts the performance of these posterior mean 
estimators in terms of the Steins loss function (denote by L\). It can be seen from Figure 4 
that, Li(Cl, fl 0 ) and Li(S, So) are minimized at S = 262 and 5 = 353 respectively. 

We now illustrate the improved performance of the posterior mean estimators when using 
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our generalized G-Wishart priors endowed with multiple shape parameters. If 12 = E^ 1 
follows a G-Wishart distribution with parameters U and 5 > 0, then for every i = j and 
(i,j) G E, E[T,ij\ = Uij/8 see (see Roverato, 2002, Corollary 2). Borrowing intuition from 
this result, we first choose a generalized G-Wishart empirical prior for 12 with U = cl p and 
S = diag(U +nS)/diag(S). Here diag(-) denotes the vector of diagonal entries of a given 
matrix. It can be seen from Table 1 that even with this empirical choice of <5 we observe 
a 30.2% decrease in Stein’s loss for 12 and 13% for E compared to the best performance in 
the single shape parameter case. Next, we perform a restricted grid search to check if the 
performance can be further improved. In particular, a 4-dimensional grid search is performed 
on (mi, m 2 , m 3 , 777.4) where for we assign 81 = rrii for all I G B; U {&;}. As shown in Table 1, 
the best posterior mean estimator obtained via this search improves the Stein’s loss for 12 and 
E by 35% and 27.7% respectively (compared to the best estimator in the single parameter 
case). 


8 

Li(12,12 0 ) 

Li(E,E 0 ) 

262 * l p 

161.9 

207.6 

353 * l p 

222.4 

158.3 

diag(U*) / diag(S) 

113.0 

137.7 

mi = 140, m 2 = 140, m 3 = 264, m 4 = 348 

105.2 

114.4 


Table 1: The first two rows indicate the best possible performance using single shape pa¬ 
rameter. The next two rows indicate the performance for an objective choices with multiple 
shape parameters and an attractive choices obtained by doing a grid search over certain 
4-valued 8 vector. 


7.2 Comparison with other Monte Carlo based approaches 

We shall show in this section that two other approaches, namely the accept-reject algorithm 
and the Metropolis algorithm, can also be used to sample from the generalized G-Wishart 
distribution. We demonstrate however that both these algorithms can have specific scalabil¬ 
ity issues, but the proposed Gibbs sampler can overcome these challenges. 

7.2.1 Comparison with the accept-reject algorithm 

A useful accept-reject algorithm to simulate 12 from a G-Wishart distribution for a general 
graph G is provided in Wang and Carvalho (2010). This algorithm can be easily generalized 
to simulate from our generalized G-Wishart distributions. Below, we present this generalized 
version of the accept-reject algorithm for a graph G, where G cannot be decomposed into 


25 




any prime components. Many graphs, such as cycles, m x n grids (with m, n > 3) satisfy 
this property. 

Let G Pg ct follow a generalized G-Wishart distribution with parameters U and S for 
some positive dehnite matrix U and S G M+. Let C/ _1 = T'T be the Cholesky decompositions 
of U~ l . Also for i = {l,...,p}, define := |{s|s > i, ( s,i ) G E a }\. The accept-reject 
algorithm can now be specified as follows. 


Step 1 Simulate T as follows; For i G cr(V), ~ Xs-+w +2 an d for * < j, (hj) G E a , 
ipij ~ iV(0,1). For i < j, ( i,j) £ E a , we calculate ipij as, 

3 - 1 

fpij = ~ Y. ipikTkj/Tjj if i = 1 

k =i 




k=i 


k =1 




if 1 < i < j 


Step 2 Simulate u ~ t/(0,1) and check whether 


u < exp j J] i’l 

\ *<J, {i,j)£Ec 7 

If this holds, then accept this value of T, else go to Step 1. 

Step 3 Set <F = TT and hi = T'T.Then hi has the required generalized Wishart distribution. 

A common problem with the vanilla application of accept-reject algorithm even in moderate 
dimensional settings is that the average acceptance probability can be extremely small. This 
issue can make the accept-reject algorithm computationally infeasible. We find that the 
same phenomenon happens with the accept-reject algorithm in the generalized G-Wishart 
distribution setting. The algorithm works well for small dimensional examples, such as the 
simulation example in Wang and Carvalho (2010) for a 7-vertex graph (where the largest 
prime component has order 4). We find however, that the low acceptance probability issue 
mentioned above, surfaces as we increase the size of the largest prime component. To illus¬ 
trate this, let G be a 12 cycle (which cannot be decomposed into prime components) with 
an ordering a as specified in Section 4. Consider a generalized G-Wishart distribution with 
parameters U° and <5. Here Lg = 100 and for |i — j\ — 1,11 Lg = 40 and 0 otherwise. The 
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i 

j 

Simulated mean 

True Mean 

i 

j 

Simulated mean 

True Mean 

1 

1 

100.6 

100 

2 

1 

39.7 

40 

2 

2 

98.99 

100 

3 

2 

40.14 

40 

3 

3 

100.4 

100 

4 

3 

40.25 

40 

4 

4 

100.3 

100 

5 

4 

39.95 

40 

5 

5 

100 

100 

6 

5 

39.79 

40 

6 

6 

99.82 

100 

7 

6 

39.75 

40 

7 

7 

99.3 

100 

8 

7 

40.23 

40 

8 

8 

100.6 

100 

9 

8 

39.97 

40 

9 

9 

100.1 

100 

10 

9 

39.9 

40 

10 

10 

99.73 

100 

11 

10 

39.87 

40 

11 

11 

99.86 

100 

12 

1 

40.01 

40 

12 

11 

39.98 

40 

12 

12 

99.95 

100 


Table 2: Comparison between simulated mean of £*■ and its theoretical mean 


shape parameters, 5, = 60 for 1 < i < 6, and 5* = 70 for 7 < i < 12. The average time taken 
by the accept-reject algorithm to complete one iteration is more than 5 hours on a 2.4 Ghz 
processor with 4 GB RAM. Clearly, this happens due to low acceptance probabilities. How¬ 
ever, the Gibbs sampler does not suffer from these issues, since no acceptance/rejection step 
is involved. The results obtained by using 10000 iterations of the Gibbs sampler (which take 
approximately 4 minutes) are provided in Table 2. In particular, this table demonstrates 
that the difference between mean of £*■ from the Gibbs sampler (for i = j or (i,j) G E a ) 
and its theoretical expectation U°[i,j\ (as given by Theorem 2) is very small. 

7.2.2 Comparison with the Metropolis-Hastings Algorithm 

A useful Metropolis-Hastings algorithm to sample from the G-Wishart distribution has been 
developed in Mitsakakis et al. (2011). This approach can also be conveniently adapted to 
the setting of our generalized G-Wishart distribution. Suppose we want to simulate from 
a generalized G-Wishart distribution with parameters U and <fi, • • • , 5 P > 0 corresponding to 
an undirected graph G = (V,E), with a specified ordering cr. Let Q = and U = T'T 
be the Cholesky decompositions of Q and U~ l . Mitsakakis et al. (2011) propose the following 
algorithm to simulate T = c&T -1 , and thereby fl from the required distribution. 

For i < j and (i,j) ^ E a they call 4/^ to be a ‘dependent’ entry, while T/ = : 

% = j or i < j and (i,j) G E a } are referred to as the independent entries of T. Also for 
1 < j < p, they define Uj := |{s|s > j, ( s,j ) G E a }\. Given the independent entries, the 
dependent entries can be calculated exactly as in Section 7.2.1. Let h(T/) = nf=i XSi+vt x 
N\E a \ (0|e ct |, I\E<r\) denote the distribution on T/, where for % = 1,... ,p, ~ Xs + Vi an d for 

i < j, ( i,j) G E a , Ty are independent standard normal. For a given positive integer N, the 
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procedure to generate N iterations of the Metropolis-Hastings algorithm is given as follows. 


• Initialize v h^ 0) by sampling Ty 0) from h(-) and set 

• For i in 1, 2,..., N do :: 

1. Sample '^ prop from h(-). 

2. Set logo = | ZvmeMW) 2 - (*r) 2 >- 

3. Sample b from Bernoulli(mm(a!, 1)). If b — 1, set = Ty' op , else set = '3/f w \ 




Figure 5: Plots comparing the maximum entry wise difference and the maximum entry 
wise relative difference between the estimated and true expected values for the Metropolis- 
Hastings algorithm versus the Gibbs sampling algorithm. 


In the above algorithm at each stage the acceptance probability depends on a, which 
then depends on the dependent entries of cur and T prop . The dependent entries in turn 
depend on the matrix T via terms of the form j 3 - for 1 < i < j < p. If the terms ^ are large 
in magnitude then we expect the terms j)^e (T (^ / p r ) 2 an< i J2(i j)^E a (^ij° P ^ 2 l ar g e 

in magnitude as well. This typically makes logo; either a large positive number or a large 
negative number which makes the acceptance probability close to 0 or 1, thereby making the 
process potentially expensive in terms of timing. To illustrate this fact lets consider the 5x3 
grid (which satisfies the Generalized Bartlett property) and take all values of {tp : i < j} 
equal to A for some A > 0 and the diagonal entries of T as 1. We illustrate below that as the 
value of A increases, the performance of the Metropolis-Hastings algorithm can deteriorate. 
In comparison, this change in A has negligible effect on the performance of the proposed 


















Gibbs sampling algorithm. If fl = EG 1 is a generalized G-Wishart with parameters U and 
<5, then for i = j or (i,j) G E a , the expected value of E*- is Uij (see Theorem 2). Thus, 
to compare the performance of the two algorithms, we check the difference between the 
estimated values of E* and U (the independent parts only) using the sup norm and also the 
relative error. Figure 5 shows the comparison between the Gibbs algorithm and the MH 
algorithm (both of which are algorithms for the generalized Wishart class) for varying values 
of A (choosing the entries of S chosen uniformly on a grid from 70 to 100 for all cases). The 
running time for both algorithms for each A is approximately 10 mins on an AMD-V 2.4 
GHz processor. 

7.3 Application to temperature data 

In this section, we provide an illustration of our methods on the (Brohan et al. (2006)) 
dataset. HadCRUT3 dataset consists of monthly temperature data provided on a grid over 
the globe starting from 1850 till 2012. The spatial resolution is at a 5° latitude and 5° 
longitude. Here, we consider 28 locations from the US map (out 1732 locations worldwide) 
as shown in Figure 6. Thus we have n = 157 samples for p = 28 variables. Our goal is to 
estimate the precision matrix f! of these 28 temperature variables. Given the spatial nature 
of the data, it is natural to impose sparsity on the precision matrix, with the underlying 
graph G as shown in Figure 6. We use an ordering a of the variables specified as follows: 
label the vertex in the bottom of the leftmost column of the grid as 1, and then move up 
the columns from south to north, and the rows from east to west. We proceed to fit a 
concentration graph model, which assumes that fl G Pg ct , with G and a defined as above. 




Figure 6: (left) 28 US grid locations for the HadCRUT3 temperature data, (right) The 
underlying graph G for the concentration graph model. 


The graph G is not decomposable, but can be shown to be a Generalized Bartlett graph 
(it is an induced subgraph of an 11 x 3 grid). Hence, the Bayesian framework developed in 
this paper can be used to obtain an estimate of fl. We use two different empirical/objective 
priors for our analysis. In particular, our first choice is a generalized Wishart prior with 
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Table 3: Posterior expectations for the two priors and mle estimate of the precision matrix 
for the temperature data in Section 7.3. 


(iJ) 

Bayes 1 

Bayes 2 

Glasso 

(2,1) 

-2.52 

-2.76 

-2.59 

(4,1) 

-1.51 

-1.64 

-1.52 

(4,3) 

-1.20 

-1.25 

-1.17 

(5,2) 

-2.75 

-2.86 

-2.81 

(5,4) 

-1.32 

-1.40 

-1.35 

(6,3) 

-2.37 

-2.40 

-2.38 

(7,4) 

-3.37 

-3.57 

-3.53 

(7,6) 

-0.72 

-0.82 

-0.71 

(8,5) 

-3.04 

-3.20 

-3.08 

(8,7) 

1.12 

1.21 

1.18 

(9,6) 

-3.58 

-4.01 

-3.73 

(10,7) 

-1.23 

-1.34 

-1.26 

(10,9) 

-1.44 

-1.39 

-1.37 

(11,8) 

-5.38 

-6.26 

-5.82 

(11,10) 

-1.52 

-1.66 

-1.54 

(12,9) 

-1.13 

-1.10 

-1.05 

(13,10) 

-2.21 

-2.44 

-2.29 

(13,12) 

-2.62 

-2.83 

-2.65 

(14,11) 

-0.49 

-0.55 

-0.50 

(14,13) 

-0.38 

-0.43 

-0.37 

(15,12) 

-3.19 

-3.44 

-3.25 

(16,13) 

-0.48 

-0.54 

-0.45 

(16,15) 

-0.67 

-0.75 

-0.67 

(17,14) 

-0.42 

-0.40 

-0.40 


(id) 

Bayes 1 

Bayes 2 

Glasso 

(17,16) 

-0.10 

-0.09 

-0.07 

(18,15) 

-2.48 

-2.83 

-2.57 

(19,16) 

-0.68 

-0.71 

-0.66 

(19,18) 

-1.46 

-1.50 

-1.43 

(20,17) 

-3.19 

-3.50 

-3.36 

(20,19) 

-2.45 

-2.89 

-2.54 

(21,18) 

-2.87 

-3.01 

-2.96 

(22,19) 

-1.88 

-2.10 

-1.83 

(22,21) 

-2.37 

-2.63 

-2.40 

(23,20) 

-3.27 

-3.62 

-3.40 

(23,22) 

-2.18 

-2.29 

-2.20 

(24,21) 

-0.84 

-0.96 

-0.84 

(25,22) 

-5.16 

-5.34 

-5.36 

(25,24) 

-0.71 

-0.66 

-0.70 

(26,24) 

-0.68 

-0.69 

-0.66 

(27,25) 

-7.53 

-9.25 

-8.19 

(27,26) 

-1.65 

-1.75 

-1.68 

(28,27) 

-3.77 

-4.25 

-3.87 

(1,1) 

4.62 

5.04 

4.65 

(2,2) 

7.26 

7.65 

7.39 

(3,3) 

4.88 

4.99 

4.80 

(4,4) 

6.79 

7.22 

6.98 

(5,5) 

6.77 

7.12 

6.84 


(iJ) 

Bayes 1 

Bayes 2 

Glasso 

(6,6) 

8.40 

9.25 

8.53 

(7,7) 

4.15 

4.46 

4.26 

(8,8) 

8.65 

9.90 

9.20 

(9,9) 

6.41 

6.54 

6.26 

(10,10) 

6.15 

6.60 

6.18 

(11,11) 

6.54 

7.47 

6.93 

(12,12) 

8.01 

8.52 

8.02 

(13,13) 

6.31 

7.04 

6.33 

(14,14) 

0.77 

0.79 

0.75 

(15,15) 

6.74 

7.55 

6.89 

(16,16) 

1.23 

1.32 

1.18 

(17,17) 

3.53 

3.79 

3.61 

(18,18) 

6.05 

6.45 

6.14 

(19,19) 

6.54 

7.31 

6.52 

(20,20) 

9.91 

11.10 

10.34 

(21,21) 

7.23 

7.89 

7.36 

(22,22) 

10.81 

11.47 

10.99 

(23,23) 

5.03 

5.46 

5.14 

(24,24) 

2.31 

2.36 

2.27 

(25,25) 

14.69 

16.93 

15.62 

(26,26) 

4.71 

4.99 

4.72 

(27,27) 

11.53 

13.52 

12.20 

(28,28) 

5.46 

6.04 

5.55 


scale parameter U 1 = J 2 s and shape parameter 5 1 with = l/Sa for 1 < i < 28. As a 
second choice, we use a generalized Wishart prior with scale parameter U 2 = J 2 8 and shape 
parameter 5 2 with 5 2 = ( S~ l )u for 1 < % < p. The Gibbs sampling procedure specified in 
Section 5.2 was used to generate samples from the two corresponding posterior distributions. 
The burn-in period was chosen to be 2000 iterations, and the subsequent 1000 iterations were 
used to compute the posterior means and credible intervals. Increasing the burn-in to more 
than 2000 iterations did not lead to significant changes in the estimates, thus indicating 
that the chosen burn-in period is appropriate. The posterior mean estimates for both the 
priors are provided in Table 3. The MLE for e P c a was also computed using the glasso 
function in R , and is provided in Table 3 as well. As noted in the introduction, an inherent 
advantage of Bayesian methods is ability to easily provide uncertainty quantification using 
the posterior distribution. The estimated 95% posterior credible intervals for both prior 
choices are provided in Table 4. 
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Table 4: 95% credible interval for elements of the precision matrix corresponding to the two 
priors for the temperature data in Section 7.3. 



95% Cl for the posterior mean | 

(iJ) 

Bayes 1 

Bayes 2 

(2,1) 

(-2.59,-2.46) 

(-2.83,-2.69) 

(4,1) 

(-1.56,-1.47) 

(-1.69,-1.59) 

(4,3) 

(-1.24,-1.16) 

(-1.29,-1.20) 

(5,2) 

(-2.82,-2.68) 

(-2.92,-2.79) 

(5,4) 

(-1.36,-1.28) 

(-1.44,-1.36) 

(6,3) 

(-2.43,-2.30) 

(-2.46,-2.33) 

(7,4) 

(-3.43,-3.31) 

(-3.64,-3.50) 

(7,6) 

(-0.76,-0.68) 

(-0.86,-0.78) 

(8,5) 

(-3.10,-2.98) 

(-3.26,-3.15) 

(8,7) 

(1.10,1.15) 

(1.18,1.24) 

(9,6) 

(-3.67,-3.50) 

(-4.09,-3.92) 

(10,7) 

(-1.27,-1.20) 

(-1.37,-1.30) 

(10,9) 

(-1.49,-1.39) 

(-1.44,-1.33) 

(11,8) 

(-5.47,-5.28) 

(-6.37,-6.16) 

(11,10) 

(-1.56,-1.48) 

(-1.69,-1.62) 

(12,9) 

(-1.18,-1.08) 

(-1.15,-1.05) 

(13,10) 

(-2.27,-2.16) 

(-2.50,-2.38) 

(13,12) 

(-2.69,-2.55) 

(-2.90,-2.75) 

(14,11) 

(-0.51.-0.48) 

(-0.57,-0.54) 

(14,13) 

(-0.40,-0.37) 

(-0.45,-0.41) 

(15,12) 

(-3.26,-3.12) 

(-3.52,-3.37) 

(16,13) 

(-0.50,-0.45) 

(-0.57,-0.52) 

(16,15) 

(-0.70,-0.64) 

(-0.78,-0.72) 

(17,14) 

(-0.43,-0.40) 

(-0.41,-0.39) 



95% Cl for the posterior mean 1 

(ij) 

Bayes 1 

Bayes 2 

(6,6) 

(8.26,8.54) 

(9.12,9.38) 

(7,7) 

(4.09,4.21) 

(4.39,4.53) 

(8,8) 

(8.52,8.78) 

(9.76,10.04) 

(9,9) 

(6.31,6.51) 

(6.43,6.64) 

(10,10) 

(6.05,6.24) 

(6.51,6.69) 

(11,11) 

(6.44,6.64) 

(7.37,7.57) 

(12,12) 

(7.89,8.14) 

(8.40,8.65) 

(13,13) 

(6.22,6.40) 

(6.94,7.13) 

(14,14) 

(0.75,0.78) 

(0.78,0.80) 

(15,15) 

(6.64,6.84) 

(7.45,7.65) 

(16,16) 

(1.21,1.25) 

(1.30,1.34) 

(17,17) 

(3.47,3.59) 

(3.73,3.84) 

(18,18) 

(5.96,6.13) 

(6.37,6.53) 

(19,19) 

(6.45,6.64) 

(7.21,7.41) 

(20,20) 

(9.78,10.05) 

(10.96,11.24) 

(21,21) 

(7.11,7.36) 

(7.76,8.02) 

(22,22) 

(10.67,10.96) 

(11.32,11.62) 

(23,23) 

(4.95,5.12) 

(5.38,5.55) 

(24,24) 

(2.27,2.35) 

(2.32,2.40) 

(25,25) 

(14.48,14.89) 

(16.71,17.15) 

(26,26) 

(4.63,4.80) 

(4.90,5.07) 

(27,27) 

(11.37,11.70) 

(13.34,13.70) 

(28,28) 

(5.37,5.56) 

(5.94,6.14) 



95% Cl for the posterior mean 1 

(ij) 

Bayes 1 

Bayes 2 

(17,16) 

(-0.11,-0.08) 

(-0.11,-0.08) 

(18,15) 

(-2.54,-2.42) 

(-2.89,-2.77) 

(19,16) 

(-0.70,-0.65) 

(-0.74,-0.68) 

(19,18) 

(-1.51.-1.41) 

(-1.55,-1.45) 

(20,17) 

(-3.26,-3.12) 

(-3.56,-3.43) 

(20,19) 

(-2.51.-2.38) 

(-2.96,-2.82) 

(21,18) 

(-2.94,-2.80) 

(-3.08,-2.94) 

(22,19) 

(-1.95,-1.82) 

(-2.17,-2.03) 

(22,21) 

(-2.44,-2.30) 

(-2.70,-2.55) 

(23,20) 

(-3.34,-3.20) 

(-3.69,-3.54) 

(23,22) 

(-2.24,-2.12) 

(-2.35,-2.23) 

(24,21) 

(-0.88,-0.81) 

(-0.99,-0.92) 

(25,22) 

(-5.26,-5.06) 

(-5.45,-5.23) 

(25,24) 

(-0.75,-0.67) 

(-0.71,-0.62) 

(26,24) 

(-0.72,-0.64) 

(-0.73,-0.65) 

(27,25) 

(-7.67,-7.39) 

(-9.41,-9.09) 

(27,26) 

(-1.70,-1.60) 

(-1.81,-1.70) 

(28,27) 

(-3.85,-3.70) 

(-4.33,-4.16) 

(1,1) 

(4.54,4.70) 

(4.97,5.12) 

(2,2) 

(7.14,7.38) 

(7.53,7.77) 

(3,3) 

(4.81,4.96) 

(4.92,5.05) 

(4,4) 

(6.70,6.88) 

(7.11,7.33) 

(5,5) 

(6.67,6.87) 

(7.02,7.21) 
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Supplemental Document 


A Proofs 

A.l Proof of Theorem 1 

Note that, 


tr ( LDL t U ) = tr ( DL T UL ) 

p 

= DjL T jULj [. L.j is the j'-tfi column of L] 

3 = 1 


where Lj is the j-th column of L. Since U is positive definite, let A > 0 be its minimum 
eigenvalue. Hence, 

tr(LDL T U) > xf^DjL^Lj > A (o, + 

3 = 1 i =1 


where Lj. denote the vector of independent entries of L j and the length of Lj- is Vj. Hence, 
there exists c\ > 0 such that, 


7T, 


P S, 


u, 


■,s(Li,D) < ci JJd/ Uj exp 
j =i 


AD, 


exp 


AD 


Lr 


2 


Now, 


and we know that 


exp 


' Lj . £l J 
1 3 


AD 


j LjL I . ) dLr = (2vr)^A-^D 


2 3 


v i v-i _ j_ 

\--T- n 2 
3 


Dj 2 J 2 exp 


AD, 


cZD, 


D 2 exp 


AD, 


dD, 


< oo, 


since > 0. Hence can be normalized to a proper density 7Tj7 )( $. Next we prove that 
Vi > j, Liij has finite expectation under this density. Since L ik Lj k D k , it is enough 


1 



to show that V7c < j, the expectation of L ik Lj k D k exists. Let us consider the following cases, 
Case 1 i > j > k 


LikL jkD k 


(Li k \/D k ) (Ljk \/-Dfc) 


Case 2 i = j > k 


Case 3 i — j — k 


LikLjkDk 


(\L ik \y/W k ) 2 


LikLjkDk 


Dk 


Case 4 i > j = k 


LikLjkDk 


yj L)k (Lik y/ Dk) 


It follows from equation 3.2, that 


\L ik L jk Dk\7Tu :S (L I ,D) < ci\L ik L jk Dk\ D t 2 +Ul exp ex P ^ ^ 

I t t n I TT ( ( A Di t 

= \L ik L jk D k \ Y Y exp I- — I exp I- —L t L,i 

x Ci n^exp(-^)exp(-^LlL, 


-L^L, 


Note that both a: 2 exp and |x| exp are uniformly bounded above in x. It 

follows by (*) that in all the cases considered above, \L ik Lj k D k \ exp (—exp (— ^^L T k L,ij 
is uniformly bounded in (Lj,D). Since we have already established that 


n 

3 =1 


D f+, 


exp 


A D, 


exp 


A D 


J L t ,L 


is integrable, it follows that L ik Lj k D k has finite expectation under ttuj- 


2 



A.2 Proof of Theorem 2 


Let 

log(-Dfc)^ dU 

Let Q k denote the principal submatrix corresponding to the first k rows and columns of 
fl Then D k = Thus 

T l °s(D k ) = log(|O fc |) 

k<p k<p 

where 5 P+ 1 = 0. Hence, 

/ exp (w rl :m + E log(|Sitl) dn= 1 

Differentiating both sides w.r.t. and assuming that we can take the derivative inside the 
integral, we get 

E 

k<p 

Since = 2(0 ; 7 1 )ii f° r & 2 max(i,j) and 0 otherwise, we observe that 


dlogflflfcl) / 5 k - 4+i \ \ = Q 
dVtij V 2 )) 




zg(U,S) := / exp 


J-tr(SlU) + J2 | 

k<p 


Uij = E I E ( a k%(h - 4+i) 

\ma x(ij)<fc<p J 

We now rigorously establish the validity of exchanging the derivative and the integral men¬ 
tioned above. Define, 

I Ga ={fie > 0,1 < i < p, and = 0 if (i,j) ^ E} 

We see that Pg ct is an open subset of I Ga and has a positive measure (with respect to induced 
Lebesgue measure on I Ga )• Thus 7r u,s() can also be seen as a density on I Ga taking value 0 
on I Ga — P Ga . We claim that ttu,s() is continuous on I Ga . Clearly, it is enough to prove this 
claim on the boundary between P Ga and I Ga — Pg^- Since Pg ct is open, C L G(t — P Gct . 
Thus for G Bg ct , 7py,<5(D) = 0. 

Let G B^y and (fid 1 )} be a sequence in P^y such that f2 n -> —> Q. We want to show 


3 




Since is positive definite, the above ratio is postie and less than equal to Qj”/ 1 , k+1 . But 
fe+i ^fc+i,fc+i, meaning that the sequence {|nj^i|/|f^|} is positive and bounded 
above. Since hi is not positive definite, either On = 0, or there exists ko such that 
|^fco+i,fco+il 0 and |02 /^q, fcoI ^ 0- 

The exponential term in nu,s(^ n ' > ) is exp (—hr(0^df/)/2) < 1 and in both cases men¬ 
tioned above we have atleast one term outside the exponential converging to 0, while the 
rest are bounded above; which proves that ypy^O^) —> 0. 

Next for O e Pqq we find the partial derivatives and double derivatives of 7r(7^(0). For 
notational convenience we replace ttu,s by 7r() for the rest of this proof. We have already 
seen that for O e P qq, 


= *•(«) -fq + 5 ] (sq%(4 - 4+0 

^ _ 1 <k<p 

Note that for k < max(i,j), = 0. Differentiating the above equation again we get, 

^P = *(n) -u iS + Y. (S4 _1 )«(4 - 4+,) +*(n) ...(*) 

_ l</c<p _ _l<k<p ^ 

For a symmetric non-singular matrix A , 

d (A~% 
dA i:j 


For D in the interior of /qq — Pqq, above partial derivatives and double derivatives exists and 
equals 0. Now, fix i > j and consider D in Bqq. Let us dehne Q h to be a p x p matrix such 
that, Q^ s = k} rs if (r, s) ^ (i,j) and Dh = + h. If Q h E lG a — Pg ct then n( ' n ' > ~ TTl ' n) = 0. 


d \\aw\- 
dAj Lw. 

2\AW\ 2 1 d\AW\ 

\A\^ + \A\^A~ 


... (**) 
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We now show that for h n —> 0, such that Q hn G P G(T ,Vn 


n( y Q hn ) — 7r(fi) n({l hn ) 


—>■ 0. 


h h 

ii n 1 L n 

This will show that for fi G B Gct , exist and equal 0. Since fl hn is positive definite 


h2n > 0. Hence 3 ko such that |fifc 0 _i| > 0 and fifc 0 | = 0. Consider 


1 ^° 1 1 • Note that 

i «n-v 


|fifcg_il —> |fife 0 —11 > 0) |fi^"| —>• |fife 0 | = 0 and jfijij is a quadratic in h, which equals 0 at 
h = 0. Hence |fijM can be written as ah 2 + bh. If 5k 0 > 2, then 


1 1 

( l«fel 


(4 0 /2) 1 

(ah 2 n + bh n y Sk o/2) 

hn 

[K-1 

u 


|(4 0 /2) h n 


K-i 


(S kn / 2 ) 


h { n k ° /2 ^(ahn + bf^M 


—>• 0 


Now note that the other terms in ^ ^ are either ( ^ ) or exp(— tr(Q hn U)/2)] which 

are all positive and bounded above. Thus if 5k > 2, V/c the partial derivatives exist and equal 
0 at B Gct . Before exploring the partial double derivatives we state the following fact, which 
is straightforward to establish. 

Fact. Ifp( fi) is a polynomial in elements of fi G P Gct and, U isp.d. then |p(fi)| exp(— tr(UQ)) 
as a function of fi is uniformly bounded above. 

If 5k > 4, Wk the existence of partial double derivatives on B Gct can proved with the the help 
of the above fact and similar arguments as in the case of single partial derivatives. 

If 7Tjj(fijj) is the marginal density of fi^ then, 


7rp(fiq) 


7r(fi)d(fi \ fi ij) 
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Note that i ly,- has support over whole real line. For arbitrary fiF e M, 


< 

< 

< 


TTij (Q°- + h) - 7 Tij ) f dn (fi) 


h 


da 


-d{a \ a 


13 > 




n(a \ Qij, a° l3 + h ) - 7r(n \ Qij, ) (97r(n) 


h 


h 

h 

h 


d 2 n(a) 




d(n \ a 


13, 


I — ^ij~\~ z 


sup 


d{a \ Qij) [for some z G [0, h]\ 

d 2 n(a) 


3% 


sup 


d(a \ a i3 
d 2 n(a) 


3% 


d.((i \ a 


13) 


...(***) 


If, 

m) 


exp(-tr(nU)/4) \a k \ (5k ~ Sk + l)/2 x 


3 =1 


~Uij +y~~](<hc— Sk+i)(a k i )ij + — s, 


2 l^fe)l 2 , 1 




k-\-l / 


k =1 


fc=l 








and / 2 (H) = exp(—tr(OC7)/4) then d .. K ‘ ) ( L : 2) = / 1 (r2)/ 2 (h2). If <h, > 4, V/c then by the fact above, 

L ij 

fi(a) is uniformly bounded above on P Gct by a constant M > 0. Thus, 


< 


/ sup 

(neP G<T | Qijein^^+h)} 


d 2 7r(a) 




f sup M exp(—tr(au)/4:)d(a\a i j) 

J (oePcJ Qije^^+h)} 


The above integral is finite, which means we can take limit as h —> 0, in (* * *) to obtain, 


d/ilj j (I - i'j ) 

dOh^ j 


dn{a) 
d a i3 


d(a \ f \j) 


Since the support of n^O is the entire real line, J d7Tx ^^ da i3 = 0. Thus J da = 0 
which establishes ★. 
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A.3 Proof of Theorem 5 


If Lj are the independent entries in L then the Jacobian of the transformation from — > 

(. Li,D ), is m =1 D?. Thus the unnornmlized density of (Li,D) is, 

= II D f + "‘ ex P (" \tr(LDL T U) 

3=1 ' 


Since G a is decomposable, and the vertices have been ordered by a perfect elimination 
ordering, 

G F Ga ttle Cc a 


Now, 


p p -1 

tr(LDL T U ) = tr(DL T UL) = ^ D 3 {L T UL) n = ^ D J^UL., + D P U PP 

j =i 3 =i 


where Lj is the j-th column of L. Now L l3 = 0 for i < j, Ljj = 1 and since G a is 
decomposable, for i > j, = 0 if (i,j) ^ E a . Thus, if Lj. denotes the set of independent 
entries in the j-th column of L , then, 


L^UL, 



Un C/J T 

, U J UH 

T l- 



e j) + Cj 


Thus, 

D ) 


D 2 exp 


p - 1 

n 

3 =1 L 

xd p 2 exp ' CpDp 


CnD , 


D, 


^ 'exp 


which implies that {(L^, Dj) p = \, D p } are mutually independent. After straightforward cal¬ 
culations we find that, 
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rs_/ 


N 




Lij \Dj 


-ji 


and 


D j ~ 


Gamma 


D, 


Vi + 5i 


'3 ' u 3 , i £? 
2 ’ 2 


Now for i > j, we calculate the expectation of = ^2 k<J L ik Lj k D k . Note that, 

$k + v k + 2 


^(-Dfc) = 




and for i > j > k, 


E(L ik L jk D k ) 


= E (E(L ik Lj k \D k )D k ) 

— E I I ——-h e k ie k j 


D, 


D k where e k i is the i-th entry of 


= (U yk )^j 1 + e ki e k j 


d k + v k + 2 


Ck 


Similarly for i > j, 


E(LijDj) — e 


Sj + Vj + 2 


j* 


Adding the expectations above gives us the required result, i.e., 

E(Q) = Y^KU^y 1 ] 0 + e T ne 

k<p 


A.4 Proof of Lemma 5 

(a) (b): It follows that for r < k the power of any D r in Li k and Lj k can be 0,-1 
only. Hence the power of D r in Li k Lj k can be —2, —1, 0 only and hence the power of D r in 
L ik L jk D k = L ik L jk D± x D 2 x ... x D k can be -1, 0,1 only. 

(b) =>■ (a): Suppose to the contrary that Property-B does not hold. Then there exist i > j 
with (i,j) ^ E a and r < j such that the maximal negative power of D r in the expansion of 

is at least —2. Let —k denote this power. Then the expansion of EjjDj has at least one 
term where the power of D r is —2k + 1. Since —2k + 1 < —3, this contradicts (b). 



A.5 Proof of Lemma 6 


Firstly, note that if cither of L ik or Lj k is functionally zero, then by assumption L ik i L ]k >D k > 

is functionally non-zero, and we are done. Hence, without loss of generality, we consider a 

situation where both Li k and Lj k are functionally non-zero. 

Case 1: Suppose (i, k) G E a or (j,k) G E a . Then L ik Lj k D k is functionally dependent on 
atleast one of L ik or Lj k , whereas L ik /Lj k iD k t is functionally independent of L ik and 
L jk by (5.4). 

Case 2: Suppose (i, k ) ^ E a and (j, k ) ^ E a . Then by (5.4), each term in the expansion of 
L ik Lj k D k is functionally dependent on D k . However, since k' < k , every term in the 
expansion of L ik iLj k iD k i is functionally independent of D k (by (5.4)). 

A.6 Proof of Lemma 7 

Suppose to the contrary that either Property-A or Property-B is violated. 

Case 1:: Property-A does not hold. 

Let Ej be the first (dependent) entry where it is violated. Thus 3 k < j, s.t. L ik Lj k j^ 
has a square term. Now if any one of them is independent, say Li k , then Lj k has 
the square term because L jk can’t have L ik in it’s expansion. But that violates the 
assumption that is the first term. Hence we have, i > j > k such that, 

(i,j) (£ E & (i,k) £ E & (j, k) £ E 
and Lij ^ 0 & L ik ^ 0 & L ]k ^ 0 


Thus we have violated Generalized Bartlett Property. 

Case 2:: Property-B does not hold. 

Let be the first (dependent) entry where it is violated. Then 3i>j>k>s, 
s.t. the power of D s in L ik Lj k j ^ is < —2. Now if any one (or both) of L ik and Lj k 
is independent, we will get a contradiction since is the first term. Thus we get 
i > j > k s.t., 

(i,j) (£ E & (i,k) (£ E & (j, k) $ E 
and ^ 0 & L ik ^ 0 & L ]k ^ 0 
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contradicting Generalized Bartlett Property. 


A.7 Proof of Theorem 6 

We prove that G satisfies Property-A and Property-B. In particular we will show that for 
every i > j, L t j satisfies Property A and Property B. Consider the following cases. 

Case 1 Suppose i ^ {pi,P2, ■ ■ ■ ,Pr} and j G GraphJbefore(i). 

Here (i, 1), (i, 2),..., (i,j) ^ E, which implies that, 

f'ijl h i.'2 • • • h /./ 0 

Case 2 Suppose i ^ {pi,pz, ■ ■ ■ ,p r } and j G Graph(i), and j < i. 

If (i,j) G E , then L ipj is an independent parameter. Otherwise using Case 1 we get 
that, 

r - -Vf T . - - V T T — 

s<j 3 s<j , s£Graph(j) J 


Since i > j > s and Graph(i) = Graph(j) is a Generalized Bartlett graph, it follows 
from Lemma 7 that L tJ has Property A and Property B. 

Case 3 Suppose i G {pi,P2, ■ ■ ■ ,Pr} and j G GraphJbefore(i ) and j ^ {pi,P2, ■ ■ ■ ,Pr}- 
Since (i,j) E again using Case 1, we get 




Vll—- 

s<j 


'3 


E 


s<j aild s^Graph{j) 


L L 


Now suppose Graph(j) = G k ■ Since (i,Pk- i + 1 ) ^ E, it follows that 

D, 




E is L Pk _ i 


+ l,s 


S<Pfc_l + l 


D 


= 0 


Pfc-i+i 


since by Case 1, L Pk _ 1+ls = 0. Now, we can show inductively for p k _i + 2 ,j; 


Ei,p k _ 1+2 0 ,..., Li j 0 
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Case 4 Suppose i G {pi,p- 2 , ■ ■ ■ ,Pr} and j G Graph(i). If (i, j) G E then Ljj is an indepen¬ 
dent parameter. Otherwise by Case 1, 


Li,j 


L . Rl- 


s<j 


'3 


E 


s<j, s&Graph(j) 


L L — 
^3 


Similar arguments as that in Case 2 can now be used to show that Lij satishes Property 
A and Property B. 


Case 5 Suppose i > j and i,j G {pi,P 2 , ■ ■ ■ ,Pr}- 

If (i,j) G E then L l3 is an independent parameter.If not, 


L . __V],. l — 

^,3 - 


s<j 


3 


Note that every s < j belongs to GraphJbef ore(i). By Case 3, for s {p 1 ,p 2 ,... ,p r }, 
L itS = 0. Hence, 


Lij — 


J2 Li ’ sLj ’ s Dj 

s<j and se{pi,p2,...,p r } 


Since i > j > s belongs to {p\,p 2 , ■ ■ ■ ,p r } and G is a Generalized Bartlett graph it 
follows from Lemma 7 L t] has Property A and Property B. 


A.8 Proof of Theorem 7 


We will prove Property-A and Property-B by considering the following cases. In particular 
we will show that for every k > k! , Lkk’ satishes Property A and Property B. 

Case 1 Suppose k G G ( f\ k’ G G { f, ] for (i,j) ^ and k > A;'. Here k is not a neighbor 

of any of {1, 2,..., k'}. Thus Lkk 1 = 0. 

Case 2 Suppose k > k' G G*f\ If (k,k') G A, then Lkk ' is an independent parameter. 
Otherwise, 


L 


kk’ 


1 

Dk 1 


E 


L hh" L b' ic" D u" 

rv r\j Ay A/ Ay 


k"<k’ 




Lkk"Lk'k"Ek 


k"<k',k"eG\ j) 


Since k > k' > k" belong to the same Generalized Bartlett graph it follows from 
Lemma 7, Lkk' satishes Property A and Property B. 
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Case 3 Suppose k E T and k' E G^\ Then Lkk' is independent. 

Case 4 Suppose k E T and k' E G^ * where k ^ %' . Note that (k, k') ^ E, and hence by 
Case 1, 

1 

\ 

k"<k'k"&G b .' ) 

7 V 

If k\ is the smallest labeled element of G\j \ Lkk t = 0. From here by induction in can 
be proved that Lkk' = 0. 

Cases 1 through 4 prove that for all k > k' , Lkk' satisfies Property A and Property B. 

A.9 Proof of Lemma 10 

We shall prove the Generalized Bartlett property by induction on n. Suppose n = 2. We 
will now express the dependent entries in terms of the independent entries and see whether 
any quadratic terms appear or not. 


E 


Lbb"Lh'h"Dk" 

Ay A, Ay Ay Ay 


Lkk 1 ~ 


D 


k' 


E L 

k"<k' 


hh" L/u'b" Db" 

Ay Ay Ay Ay Ay 


t-< 

CO 

— L 51 — L 61 — 

^62 — 0 


L 42 

- L L Dl 



CO 

- L L Dl 

— — n 4 i^ 3 i — 

— T42T32 

D 2 

— = L 41 L 21 L ; 

-^3 

E53 

D 2 

= —F52T327W, 
-^3 

Lq 4 = 

L >3 

— Lq 3 L 4 3 — 

U 4 


Thus for (n = 2) x 3 grid, Property-A and Property-B is satisfied. Now suppose it is satisfied 
for (n = k) x 3 grid, we want to prove for a (n — k + 1) x 3 grid. 

Again we will express the dependent entries in terms of the independent entries and see 
whether any quadratic terms appear or not. The induction hypothesis implies that Property 
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A and Property B are satisfied for L t] with i < 3k. Note that, 


L?>k+i,i 

Lzk+l,3k-l 

-b 3 fc+l,3fc 

L^k+2,1 

L 3k +2,3k 

L?>k+3,l 

A 3 fc+3,3fc+l 


L 3 fc+1,2 — ••• — L 3 fc+l,3fc-3 ~ 0 

D3k-2 


~L3k+l,3k- 2 L3k-l.3k-2 


-L 


3k+\,3k-2Lzk,3k-2 


—L 


3fc+l,3fc-2-t' 3 fc,3fc-2‘ 


Dzk-i 
D3k-2 

D'Sk 

D3k-2 


~ L3k+l,3k-lL3k,3k-l 


Dsk-i 
~D 


3 k 


D 


3 k 


+ L3k+l,3k-2L3k-l,3k-2Lzk,3k-l- 


D3k-2 


D 


3k 


~I J 3k+2,3k—lL / 3k,3k—l 


[L 3 k, 3 k -2 does not contain L 3k+1M _ 2 \ 

-'3k+2,3k 

Dsk-i 


— L3k+2,3k-3 — L3k+2,3k-2 ~ 0 


D 


3k 


L3k+3,2 — ■ ■ ■ — L 3k +3,3k-3 ~ L3k+3,3k-2 — ^ 3 fc+ 3 , 3 fc-l — 0 

r r D ‘ik 


j 3k+3,3k lj 3k+\,3k 


D 


3fc+l 


L 3 k+i, 3 k does not contain L 3fc+3 , 


3fc I 


Thus a (n — k+ 1) x 3 grid satisfies Property-A. We now proceed to check Property-B. Since 
L 3 k, 3 k -2 and L 3k+13k satisfy Property-B, so does T 3fci3fc _ 2 %and L 3k+ lMjrrT- Hence by 
induction, and Theorem 4, for all n > 2, nx3 grid satishes the Generalized Bartlett property. 

B Comparison with Letac-Massam distributions 

For a decomposable graph G = (V,E), Letac and Massam (2007) generalized the Wishart 
distribution, by defining two classes of distributions with multiple shape parameters on the 
cones Qg and Pc- They are called Type I and Type II Wishart distributions and have proved 
to be useful in high-dimensional Bayesian inference, as shown in Rajaratnam et al. (2008). 
Let Ig denote the set of incomplete p x p matrices X, where X t j is missing iff (i,j) (f E. 
Recall that Qg is defined as 

Qg = {A" G Ig\Xc is positive definite for all cliques C in G}. 

For U G Qg, let U be the unique p x p matrix such that, U l3 = Uij for i — j and (i,j) G E, 
and £/ -1 G Pg- Also for a p x p matrix X, let k(X) be symmetric incomplete matrix such 
that (i, j)-th entry is missing if (i,j) ^ E, and n(X)ij = X i3 otherwise. 

It is natural to compare and contrast our generalized G-Wishart distributions with the 
Letac-Massam distributions when G is decomposable. 
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First, we consider the W^ G family of Type If Wishart distributions (defined on the space 
Pg) in Letac and Massam (2007). In particular, the Wp G a,f3 density on Pg is proportional to 


w^(n) 


OC 


exp (—tr(QU)/ 2) x 


n c « : l(n- , W “ (0,+(0+1)/2 

n S£S l(n- 1 )sh s|WS,+< * +1 >/ 2 )' 


Clearly, the exponential term in the above density is the same as the exponential term in the 
generalized G-Wishart density in (3.1). Now let us compare terms outside the exponential. 
For the generalized G-Wishart density in (3.1), the non-exponential term is 


D 


Si /2 J-.62/2 -p. 5^/2 J--.64/2 
11 ^22 -^33 -^44 


5 


where = LDL T is the modified Cholesky decomposition of Q. The corresponding term for 
Wp G is 

n OT i(«- 1 )d“ (c)+(c+1)/2 

n S£i i(si- i ) s h ,S|(i,,S|+( * +,)/2 >' 

To contrast these two terms, we consider the case when the graph G is the 4-chain, A 4 , given 
by • — • — • — Hence, C\ = {1,2},G2 = {2,3},G3 = {3,4} and S 2 = {3}, S 3 = {4}. It 
follows that 

nLiffl-’iar = ( 1 Y'( 1 ^ r 2 2 Lj 3 y- ft 

nf =2 \Du) \D 22 D 33 D 44 ) 

X / 1 yY 1 1 

\D 22 ) \d 33 d 44 ) \D 33 D 44 ) 

Thus even for this simple graph the non-exponential term for Wp G is very different than the 
corresponding non-exponential term for the generalized G-Wishart. However, if the graph G 
is homogeneous, then Letac and Massam (2007) shows that for any clique G and separator 

<5, 

ieC Uu ieS U%1 

In the homogeneous setting we see that the term outside the exponential is similar to that 
of the generalized G-Wishart. The family of generalized G-Wishart distributions introduced 
in the paper are therefore in general structurally different than the family of Type I and 
Type II Wishart distributions introduced in Letac and Massam (2007). In the special case 
of homogeneous graphs, the family of generalized G-Wishart distributions coincides with the 
family of Type II Wishart distributions in Letac and Massam (2007). 

Next we consider the family of Type I Wishart distributions (defined on the space Qg)> 
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which is refereed to as Wq g in Letac and Massarn (2007). The family of inverse Wishart 
distributions induced by Wq g on the space Pg is referred to as IWq g . In particular, the 
IWZZ* density on Pg is proportional to 

TT „ hO- 1 ')r.| a ( C ) + ( c+ d / 2 

exp U)/ 2) x j-j ^“ | (Q-i)^|^(S)( / 3 (S)+( s +i)/ 2 ) ’ 

where U G Qg and a(C), C G C, /3(S), S G S are real numbers, c = |Cj, s = (S'! and u(S) is 
the multiplicity of the minimal separator S which is positive and independent of the perfect 
order of the cliques considered (as proved by Lauritzen (1996)). Note that, as expected, 
the exponential term in the above density is exp (— tr(Q~ 1 U)/2), whereas the exponential 
term in the generalized G'-Wishart density in (3.1) is exp (— tr(QU)/2). Hence the difference 
between the two classes is fundamental. 


C Model Selection Example 

We now demonstrate through a simulation experiment that the methodology proposed in the 
paper is competitive with standard methods for high-dimensional graphical model selection. 

For a given dimension p, a “true” sparse graph G = ( V , E) with p vertices is chosen 
by taking a simple random sample (without replacement) of size p WW * o.Ol from the 
total number of possible edges. We consider five different values for the number of variables 
p, ranging from p = 50 to p = 1000. The sample size is chosen to be 100, 200 or 300. 
Then, a “true” precision matrix Ho G Pg is generated by taking Ho = LDL T , where D 
is the identity matrix and L,j is a some constant (depending on p) if {i > j, (i,j) e E} 
else Lij = --jj-J2 k< jL ik Lj k D k . Then, n i.i.d. samples from a ^(OjHq 1 ) distribution are 
generated. Let S denote the sample covariance matrix of these n samples. 

The goal now is to estimate the original graph G. Our approach is as follows. We first 
obtain a collection of “good” models (equivalently, graphs) by using the popular penalized 
graphical model selection method Friedman et al. (2007), and then use our Bayesian approach 
to select the best model out of this collection. The Glasso method takes a penalty parameter 
p as an input, and for a given value of p provides a sparse estimate of the inverse covariance 
matrix H. The sparsity pattern in the estimate of H in turn leads to an estimate of the 
underlying graph/model. Banerjee et al. (2008) propose a simple and popular method for 
choosing the penalty parameter p for Glasso, and thereby choosing a graph. 

Our model selection algorithm works in conjunction with Glasso, the penalized likeli¬ 
hood algorithm introduced in Friedman et al. (2007). We shall consider a grid of penalty 
parameters for Glasso and consider models with various levels of sparsity. Before applying 
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the Glasso algorithm we standardize the covariance matrix S. In this case, it is known that 
p = 1 produces extremely sparse models and p close to 0 produces extremely dense models. 
Our penalty parameter grid starts with p = 1 and ends at 0.01 and decreases by steps of 
0.02. For each value of p in this grid, the Glasso algorithm is run to obtain a graph estimate 
with adjacency matrix M p . Graphs with edge density from 0% to 10% are considered. 

For each M p thus obtained, we use the Deviance Information Criterion (DIG) as a mea¬ 
sure of how well the estimated graph/model fits the data. Recall from Mitsakakis et al. 
(2011) that DIC = 2 D — D(Q), where D(Q) — n * (hr (OS') — log(|O|)), D is the posterior 
expectation of D(Q), and O is the posterior expectation of D. Ideally, for each value of 
p, we would like to compute the DIC for the model corresponding to M p . Note however, 
that the graph corresponding to M p may not in general be Generalized Bartlett. Thus we 
generate a Generalized Bartlett cover M cover,p for M p as described in Algorithm 2. If p is 
large, and M cover,p is quite dense, then for computational reasons, we choose a decomposable 
cover M dc,p using the R package “igraph” . Once the appropriate cover has been computed, 
we compute the DIC score corresponding to this cover using hyperparameter values U = cl p 
(where c is the mean of the diagonal entries of riS) and Sj = (Ujj + nSjj)/Sjj for 1 < j < p. 
This DIC score is treated as a measure of goodness of fit for the model corresponding to 
M p . Finally, we choose the M p with the best goodness of fit score. For each value of p, the 
whole process is repeated 20 times, and the average sensitivity and specificity is reported in 
Table 5. For comparison purposes we also report the average sensitivity and specificity of 
the model obtained by using the li penalized likelihood method proposed in Banerjee et al. 
(2008). 

In order to make sure that we are searching for models in a range whose edge density 
includes the true density (1%) we use the starting value of p = 1 (edge density almost 0%) 
and the algorithm ends at value of p with edge density around 7%. Thus the true edge 
density of 1% lies in the range. 

We compare the model selection performance of Glasso (using the approach in Banerjee 
et al. (2008)) and the generalized G-Wishart based Bayesian approach outlined above, in Ta¬ 
ble 5. Both approaches have very high specificity, with the Glasso based approach performing 
slightly better. On the contrary, the generalized G-Wishart approach shows an immense im¬ 
provement in terms of sensitivity as compared to the Glasso approach. This is particularly 
useful in high dimensional biological applications where discovery of an important gene is 
much more important than exclusion of a non-important one. 
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p 

n 

Specificity 

Sensitivity 



Glasso-Ban 

gen. G-Wishart 

Glasso-Ban 

gen. G-Wishart 

50 

100 

1 

0.9830 

0.5833 

1 

100 

100 

1 

0.9714 

0 

0.8663 

200 

100 

1 

0.9316 

0.0007538 

0.7781 

500 

200 

0.9999 

0.9166 

0.0041 

0.5570 

1000 

300 

0.9899 

0.9214 

0.0023 

0.2772 


Table 5: Model selection comparison of Glasso (with penalty parameter chosen by Banerjee 
et al. (2008)) and generalized G-Wishart based Bayesian approach 

D Application to Breast cancer data 

In this section, we use the methodology developed in this paper to analyze a dataset from a 
breast cancer study in Chang et ah (2005). This study is based on n = 248 patients, whose 
expression level of 24481 genes are recorded. As in Khare et ah (2015), we focus on the 
reduced dataset of p — 1107 genes closely associated with breast cancer. The objective is 
to obtain a sparse partial correlation graph, i.e., a sparse estimate of the inverse covariance 
matrix for the 1107 genes, to identify the hub genes. As in Section C, we shall choose can¬ 
didate partial correlation graphs by using penalized likelihood/pseudo-likelihood methods, 
and then choose the best graph by computing the DIG score using the Bayesian methodol¬ 
ogy developed in this paper. The idea is to reduce our search space to a handful of graphs 
and then use the generalized Bartlett methodology developed in this paper for Bayesian 
model selection. To obtain the candidate graphs, we shall use four standard penalized algo¬ 
rithms: SYMLASSO (Friedman et al. (2010)), CONCORD (Khare et al. (2015)), GLASSO 
(Friedman et al. (2007)) and SPACE (Peng et al. (2009)). For each of these algorithms, the 
respective penalty parameters are chosen so that the resulting partial correlation graph has 
100 edges. All the four graphs thus obtained are not Generalized Bartlett, and we obtain 
Generalized Bartlett covers for each of them using Algorithm 2. All of these covers have at 
most 3 extra edges as compared to the original graph. 

Note that each of these four partial correlation (cover) graphs represents a concentra¬ 
tion graph model. Let S denote the sample covariance matrix. Note that, as in Khare 
et al. (2015), each of the p = 1107 data columns were centered and scaled (with respect to 
mean absolute deviation) prior to computing S. For each of the four models, we choose a 
generalized Wishart prior with parameters U and S as 

U = mean(diag(n * S)) * I p + n * S 
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S = mean(diag(U) /diag(S))l p , 

where 1 denotes the p -dimensional vector of all ones. Next, we run the Gibbs sampling 
algorithm for each of these four scenarios for 1000 steps. The resulting Markov chains are 
used to compute the DIG score for each of the four partial correlation graphs (using the 
procedure from Mitsakakis et al. (2011) outlined in Section C). The DIG scores are provided 
in the Table 6, and show that the graph chosen using the CONCORD algorithm performs 
the best. This example illustrates that the methodology developed in this paper can be used 
in conjunction with DIG for high-dimensional graphical model selection in applied settings. 


Algorithm 

DIG 

SYMLASSO 

CONCORD 

GLASSO 

SPACE 

298816.6 

295766.6 

299601.1 

299302.8 


Table 6: Comparison of 4 algorithms 
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